dinsdag 26 januari 2016

Using Gephi to determine trending subjects in libraries

Using Gephi to create a network of subjects, which were exposed to users who used the OPAC of the Peace Palace Library during the month of December, 2015, I am able to observe clusters of related subjects. In the upper level of this network I see one huge cluster with the subject “Human rights” more or less in the center of this cluster. Zooming in on this cluster enables me to weed out the subjects which are not strongly related to “Human rights”. After several zooming in sessions, one smaller cluster with “Human Rights” in the center remains.  Gephi no longer distinguishes subsections anymore. So the resulting table below contains a set of more strongly related subjects, centered around the main subject “Human rights” in December 2015.



slope

Label

-1.385
Human rights


-0.1599
Freedom of expression
-0.0741
Civil society
-0.0678
Human dignity
-0.0203
Oceania
-0.0171
Thailand
-0.0106
World Bank Group
-0.0102
Regional instruments
-0.0066
Blasphemy
-0.0036
Universal Declaration of Human Rights (New York, 10 December 1948)
-0.0033
Committee on the Elimination of Racial Discrimination
-0.0025
Human rights commissions
-0.0015
International Convention on the Elimination of All Forms of Racial Discrimination (New York, 7 March 1966)
-0.0012
Criminalization
-0.0009
Office of the United Nations High Commissioner for Human Rights


+0.2683
International instruments
+0.1403
Obligations of the state
+0.1362
Education
+0.1334
Civil and political rights
+0.0489
International law and domestic law
+0.045
Human Rights Committee
+0.0413
Legal remedies
+0.0409
International Covenant on Civil and Political Rights (New York, 16 December 1966)
+0.0202
Freedom of information
+0.0173
Hate speech
+0.0129
Reporting
+0.0078
Accountability
+0.0076
Committees
+0.0036
United Nations Human Rights Council
+0.0023
Treaty bodies
+0.0004
Freedom to provide education

How about the use of this set of interrelated subjects in other months of 2015? Is it possible to determine a trend during  the whole year 2015? In order to investigate this, I collected a lot of data which have to do with subjects. Every month of the year 2015 I collected all subjects exposed to our users of our OPAC and then let Gephi calculate several values related to each subject in these monthly networks. All these different values I stored in a database, but in the following I am just interested in the so-called betweenness centrality value. This value stands for, and I quote : “In brief,betweenness centrality is an indicator value for a key position. The higher the value the more important the role of the keyword. This value is calculated by counting the shortest paths between two keywords in our network. The keyword which appears the most times as being in between two different keywords, has the highest betweenness centrality value; these keywords are brokers or intermediaries.” [http://pushaqa.blogspot.nl/2014/11/just-below-surface.html]

So it is possible to measure the popularity of certain subject areas by using the 'weight' of these subject areas in the bigger picture of monthly subject networks. This means I used the values calculated in relation to the complete sets of subjects, not just te values of the subset. I then calculate the trendline of these weigths.  The slope of the trendline indicates an increase (positive slope) or a decrease (negative slope) in popularity. In the table of subjects above, I also mentioned the slope in accordance to each subject. A division in popularity is shown; there are subjects with a decreasing popularity like “Human rights” itself, or “Freedom of expression”. Increasing popularity can be observed in “International instruments” and “Obligations of the state”. If I take the complete set of subjects in consideration there is an average of -0.027, so a very slight decrease of interest in ‘Human rights and related subjects”.

There is a increase in interest if I do the same exercise with "Law of the sea": on average 0.142.

It is my belief that knowledge about the development of interest in a particular subject, can help libraries to create better services for its users.


Creative Commons-Licentie
Dit werk valt onder een Creative Commons Naamsvermelding-NietCommercieel-GeenAfgeleideWerken 4.0 Internationaal-licentie.

dinsdag 20 oktober 2015

Inzoomen op de outliers.

Geregeld sprak ik hier over de trefwoorden en hun rol in trefwoordnetwerken. Ik liet grafiekjes zien van netwerken op basis van de brugfunctie die trefwoorden kunnen vervullen in netwerken (betweenness), over de veronderstelde invloeden van een trefwoord in een netwerk (eigenvector), over trefwoordmanifestaties in de OPC, Plinklets etc.

De gedachte is dat trefwoorden met een hoge betweenness en / of eigenvector waarde een -zeg- meer belangrijke rol spelen in het trefwoordnetwerk. Dit lijkt bevestigd te worden door het grove, oorzakelijke verband dat tussen beide waarden kan worden aangetoond. Zonder naar een beeld van een dergelijk netwerk te kijken weten wij al dat de geografische trefwoorden uit de aard der zaak een dergelijke rol zullen spelen. Dat komt, omdat dit soort trefwoorden eigenlijk overal kunnen opduiken: Nederland en piraterij, Nederland en familie recht, Nederland en terrorisme. Binnen een netwerk van aan elkaar gerelateerde onderwerpen vervuilen de geografische aanduidingen eigenlijk of, anders geformuleerd, zijn zij van een andere orde. In het navolgende heb ik daarom de geografische trefwoorden uitgefilterd. Bovendien beperk ik mijzelf in eerste instantie tot gegevens uit de maand augustus.

De vraag: "Welke zijn nu de trefwoorden die een relatief hoge betweenness en eigenvector waarde hebben?" is met behulp van de programma's Gephi en  R vrij eenvoudig te beantwoorden. Eerder zei ik al dat er een oorzakelijk verband is tussen de betweenness en eigenvector waarden: een hoge eigenvector waarde heeft bij hetzelfde trefwoord ook een hogere betweenness waarde en omgedraaid. Per trefwoord kunnen de verhoudingen overigens wel verschillen. Als je beide waarden in een grafiek uitzet dan zie je dus een denkbeeldige lijn tussen de trefwoorden door van grofweg linksonder naar rechtsboven. In de gegevens hieronder worden alleen hogere betweenness en eigenvector waarden meegenomen, maar niet de allerhoogste, die van Human rights of European Union bijvoorbeeld. Alle waarden meenemen levert een volledig volgelopen grafiek op, want dan duiken ook de trefwoorden op met wel heel lage waarden.


Links op de y-as zien we een niet realistische aanduiding van de getalswaarden. Ik heb de waarden opgerekt met een factor 8 om een betere vlakverdeling zichtbaar te maken. Op de x-as staat een wetenschappelijke notatie van hele lage eigenvector waarden. Deze waarden worden altijd in heel lage waarden aangeduid, vandaar. Iedere punt is een trefwoord.

Als ik nu in R instel dat we drie clusters moeten aanwijzen op basis van de eigenvector waarden dan resulteert dat in het volgende plaatje.



Programma R groepeert dus zoals getoond. Met het oog op eerste grafiek zou de mens wellicht meer clusters herkennen, maar ik heb heel expliciet aangegeven dat we met drie clusters werken. Toch is nu al te concluderen dat er meer trefwoorden met een lagere eigenvector- en betweennesswaarden zijn (zwart) dan trefwoorden met hogere waarden. Dat is niet verrassend natuurlijk. Als we hetzelfde doen, maar dan met betweennesswaarden dan ziet dat er zo uit.

Het vraagt een aparte studie om de clusters met elkaar te vergelijken en bijvoorbeeld eens te kijken naar de overlap in het groene cluster hierboven en het rode cluster in de grafiek daar weer boven. Maar het is natuurlijk ook mogelijk om te bekijken in hoeverre de clusters kleur krijgen als de aantallen manifestaties van de trefwoorden als leidraad voor de clustering worden genomen. En dat gebeurt hieronder.

De positie van het trefwoord in de grafiek wordt dus bepaald door de beide waarden, de kleur door het aantal. En dan is links onder ineens interessant, want het is vooral daar dat een zekere vermenging optreedt. Rood=300-1000, groen=1000-1250 en zwart=1250-3000 manifestaties. De groene stippen helemaal links, die gezien de aantallen manifestaties in relatie tot hun positie in de grafiek opvallend zijn, zijn de volgende trefwoorden (met * de trefwoorden die altijd opduiken) en hun aantallen:

Augustus:
Foreign direct investment - 1158, *Public international law - 1138, *Private International Law - 1127, *International criminal law - 1022, *International law - 1005
Wie had dat gedacht? Een relatief lage brugfunctie, een relatief lage invloed op de omgeving, waardoor kan dan de rol van het trefwoord 'Foreign direct investment' in de maand augustus 2015 worden verklaard?

De situatie in juli:
*International law - 1488, *Public international law - 1425, *United Nations - 1216

In juni:
*International criminal law - 960, *International humanitarian law -743, *Public international law - 739, Environmental protection - 626

In mei:
*International criminal law - 1047, *International humanitarian law - 906, *United Nations - 831, *International law - 795

In februari:
*International criminal law - 878, *International law - 794, *International humanitarian law  - 724, Terrorism - 608 (Hebdo?), *United Nations - 591, Environmental protection - 507

In januari duikt Terrorism net op in de grafiek en dat wordt doorgezet in februari (Hebdo?). Verder zien we in twee maanden het trefwoord 'Environmental protection' opduiken als opvallende manifestatie. Niet de belangrijkste, maar wel een belangrijke in het oog springende manifestatie.

De vraag is nu, is er op basis van bovenstaande een maandelijks 'belangstellingen profiel' samen te stellen of niet? En wat betekent dat dan voor de bibliotheek?

donderdag 16 juli 2015

Some thoughts about subjects

[I wrote an internal memo, which I would like to share on this platform, although some statements were previously published in earlier blogs]

Nowadays libraries operate in a time in which tremendous changes occur. The familiar financial foundation of every library has been removed and replaced by a much more weaker one. The search expertise of library users is, increasingly, becoming a reflection of the search methodology used to do a Google search or a Bing search. Especially the libraries associated to universities and other research facilities strongly present themselves as a participant in performing research; as suppliers and managers of data. And last but certainly not least the type of the collection offered by libraries is rapidly changing, from paper to electronic files made available in any form whatsoever. And as such contributing to difficult technicalities and a legal world which could be described as a world of quicksand.

Nowadays, in this hectic world with budget cuts, it is of the utmost importance for libraries to clearly present themselves and their collections to their users and coming users. And there are a lot of ways to do so. Clear websites, simple but solid library software, being topical and actual, be there where your users are (Facebook, Twitter), connecting to users through the medium of newsletters and alerting systems, etc. Less obvious is to bring parts of the collection in the limelight, including the 'old-fashioned' parts; books and journals. The library of the Peace Palace is one of the libraries which try to draw attention to specific parts of their collections. On a regular basis specific components, called research guides with actual and relevant bibliographical data, are placed in the foreground.

Libraries are also adding subject headings to the standard metadata of their documents, thus enriching the collections they manage. With this extra metadata users are able to locate relevant information in a more specific way. Unfortunately, this effort is not fully used by the patrons in the library. Just a very small percentage of OPAC queries use subject indices and those users who do, hardly never combine different subject headings. So how to increase the 'return on this investment'? I think the supposed disinterest of our users can be attributed to ignorance; most of them simply don't know subject headings exist or at least don't know what can be done with them. I'll give an example to show what I mean. In our search log I detected two different users both searching with the simple word 'genocide'. Both switched the search index from 'title words' to 'all words', so both knew how to use the index system of our library software, but neither of them bothered to search while using the 'subject headings' index, which of course gives a more reliable outcome.

You can try to change this behaviour by simple instruction and/or by showing how our subject headings appear in results after a search. Not by showing how users embed subject headings in their searches -this is hardly done, like I said- but by showing which subject headings appear in a set of results, generated by more common search types. I decided to try the last, so in trying to explain why using subject headings is important, I actually use the end, not the start of this route. The most informative and still compact method of presenting this kind of data is the one which uses an interactive map.

The software to make this possible is Gephi, an open source program, so freely available. Gephi is usually used to visualize strong or even weak relations between persons or websites, but I thought it could be possible with subject headings too. Simply imagine there is a strong relation between the subject headings in the metadata belonging to one document and a weaker relation between the same subject headings belonging to different documents.

The knife cuts both ways if a larger set of results is collected to be used in Gephi. Not only the subjects headings more or less strongly related to one another are shown, also the different subject areas, huge and small, could be visualized. I decided to collect all viewed titles in our OPAC in June 2015. Almost all titles did have subject headings and these subject headings were stored in a file which can be dealt with by Gephi. All in all I collected 2900 different used subject headings (nodes) and 72500 different relations (edges) between these subject headings all with their own weights. (This is not the place to explain the intricacies of Gephi, but, if you really want to know, please search for 'Gephi' on the Internet. There is a lot of information available.)

Creating maps with Gephi is one thing, but making them available on the internet is another. Luckily Gephi allows users to create plugins, which can be used to create different layouts and statistical or relational models. It is also possible to create plugins which can be used to export the maps and building blocks of these maps. The Oxford Internet Institute: http://www.oii.ox.ac.uk/ (University of Oxford) together with JISC:http://www.jisc.ac.uk/about created such a plugin with which it is possible to export relevant data and scripts using just Javascript. So all browsers using Javascript will be able to present clickable maps, no browser extensions needed.

In short, after clicking below mentioned link, you will see smaller clusters of subject headings indicating interest in more specific subject areas like 'Environment' or 'Nato and Ethics', but also some huge clusters referring to more general subjects like 'Human Rights' or 'European Union'. It is possible to zoom in and out using the little zoom toolbar below the map, or to select one cluster for more detailed inspection using the Group Selector (to the left). Clicking one occurence in the map gives a lot of information about the chosen subject heading, like detailed, statistical information about strength or weight and other subject headings with which it was combined (popup to the right). This way it is indicated which subject headings where combined to describe the contents of different but related publications or giving a hint to start searching using combined subject headings with the restrict[] option in our OPAC.

Please visit http://www.peacepalacelibrary.nl/june to see and use the map which gives an overview of the data mentioned above. You need more information? Questions? Contact Aad Janson at a.janson at ppl dot nl

donderdag 11 december 2014

Country profiles and OPAC use.

The library of the Peace Palace serves a global community. It is global because I can see this in the standard logging of the website. The use of all our website pages ends up in a log file and every line in this logfile contains the ip-number of the one who is using that specific page. This ip-number can be translated to a country of origin.

And that is what I did in my blog "MOOC: learning and instruction: Tableau and library use". In this blog I presented several maps, one of them dealing with the use of our 'human rights' website pages. The map is to the left (sorry about Alaska). One might say that the use of these pages is at least partly motivated by searching the Internet. And indeed that is usually the case. There will be only a handful of people who have added the library of the Peace Palace in their bookmarks.

Of course it is possible to follow the users if they move around on our website, but -in general and in truth- they mostly leave short tracks. I even think they are too short to make any substantiated conclusions about what our website users are looking for exactly. It is a lot easier to rise above the personal and to pay more attention to the geographical level. That means creating world maps, just to start with.

Next to the website, libraries also provide an OPAC and normally these libraries have a web search interface to their collections. And our OPAC server, you guessed it, produces log files. These log files look a lot different than the log files of regular web servers. Since the beginning of this week we collect these files (thanks to OCLC, The Netherlands) in order to parse them, store relevant data in a database and draw some conclusions. Like I stated, these files are a bit more complicated then the web server log files I usually look at, but I'am sure that in the next couple of months I will be able to deal with them. Just a random example of one log entry:

#XXX.XXX.XXX.XX 60765 1418079639.497074 GET /DB=1/SET=2/TTL=1/CMD?ACT=SRCHA&IKT=4&SRT=YOP&TRM=population HTTP/1.1
Host: catalogue.ppl.nl
Connection: keep-alive
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8
User-Agent: Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.71 Safari/537.36
Referer: http://catalogue.ppl.nl/DB=1/SET=2/TTL=1/NXT?FRST=16
Accept-Encoding: gzip, deflate, sdch
Accept-Language: fr-FR,fr;q=0.8,en-US;q=0.6,en;q=0.4
Cookie: DB="1"; PSC_1="CMD_COLLAPSE%07N%07%08FKT%074%07%08FRM%07population%07%08IMPLAND%07Y%07%08LNG%07EN%07%08LRSET%072%07%08REFERER%07%07%08SET%072%07%08SID%07b6cadb8a-0%07%08SRT%07YOP%07%08TTL%071%07%08XSLBASE%07http%3A%2F%2Flbs-vrep.oclc.org%3A8282%2Foclc_gui%07%08XSLFILE%07%253Fid%253D$c%2526db%253D$d%07%08"

Where you see all the XX's there is the ip-number, which I replaced for obvious reasons. On the same line I see 'IKT' which indicates the index used for searching and 'TRM' with the actual term looked for, 'population'. I also see our user has found another set of results before (SET=2, there must be a previous SET=1). I also notice -in Cookie:- the session identifier (b6cadb8a-0), which I can use to recreate all the actions of our user. In summary, I can say that there are a lot of possibilities to collect useful information. Information the library can use to provide optimal and actual information and services like instruction to its users.

To end this blog I will show two maps -I used Tableau- containing counts of succesful searches in our OPAC during just two days, 8-9th December. One map shows the use of our OPAC in Europe and not surprisingly, The Netherlands score the best.


In the other map I had to leave out The Netherlands to create sufficient distinction in colour. This map contains the same data as the map above. 


woensdag 3 december 2014

Presenting keywords. Eh? Which keywords? And how?

It is always hard to get started with bibliographic research. Especially for those patrons and scholars who realize it is important not to just search by some words using a very general index, but to use keywords (sometimes also called subject headings). These users know that a very dedicated group of librarians has thoroughly examined the publications added to their OPAC and has enriched them with keywords. And therein lies a problem, because one might ask: "how do these keywords look like?".

In my last couple of blogs I was focusing on how to get an idea of subject areas (huge and small). I used Gephi to create maps which could indeed give an impression of subjects and subject areas. For technical reasons, however, these maps dealt with only a relatively small set of data. In the latest published map I only incorporated about 4,500 exposed titles to our users in the reading room of the library. This looks like it is much, but in fact it is not. Therefore smaller subjects may not appear during this specific time frame or stay unnoticed. To get a more thorough impression of the subjects and the keywords used in the library one should use as large a set of information as possible. Luckily I have collected such a set, but before I come to that, I like to sketch a situation.

So, suppose some patron is looking for publications about biological warfare and the Security Council (we are after all the Peace Palace Library). Chances are (about 70% of our users do so) he uses the all words index in our OPAC and types 'biological warfare security council'. No hits. He then tries 'biological warfare' using the same index. 92 hits, which he quickly scans to look for what interests him (about 6 pages of title info). He then tries 'security council'. 1,416, hits which he does not scan, because it is to much. Now suppose he all of a sudden realizes he should also use the keyword index. Biological warfare, 213 hits. Security council, 2,741 hits. Combined (we just assume that he knows how to do this), 1 hit (a freely available PDF file containing references to other freely available texts and links to websites).

It is my opinion that libraries should bring to the forefront sets of keywords, all related to one general subject. Library users then just have to check these overviews in order to comprehend which keywords they should use in their research. In order to build up this set it is important to collect just the keywords which were, preferably during a longer time span, exposed or used by our patrons. This way we can be sure all relevant keywords can be collected. 

In January of this year I started to collect all the records, which were, in one way or another, seen or used by our patrons and visitors. This includes the records 'seen' by search robots, like the ones from Google. The database (we use for this MongoDB as a database system) contains now almost 7,500,000 records, but keep in mind that less then 10% of this number actually can be attributed to human beings. Each record contains, publication id, ip-number, time stamp and the keywords belonging to this publication. Given the size of the database it is possible to collect the really used or exposed keywords related to just one general subject.

I managed to create such a set of keywords as an example. They all were used in some combination with my 'main' keyword Biological and chemical weapons. The set contains a little bit more than 410 different keywords, some used quite a lot, others just a few times. 

So what remains is, to determine what kind of presentation to use to get a quick and thorough impression about specific keywords. I decided for now to use Tableau and to draw three diagrams with different colors. Each diagram indicates the relative amount of use and the keyword description, each next diagram present the keywords used less and less. So if you are looking for a publication about chemical warfare and genetic manipulation, after a peek at the diagrams below you will know what keywords to use to get to this information. (Curious? Look at this chapter: Terrorism in the Genomic Age / John Ellis, 2004.) 

And let's not forget serendipity.





donderdag 27 november 2014

Keywords. Collecting data.

In the last couple of weeks I blogged about keywords as they were displayed to the users of the OPAC in the library of the Peace Palace. I showed a couple of maps built with Gephi, some exhaustive, others very detailed.

But, how did I collect and adapt the data to be used by Gephi? I already mentioned "exposed keywords to the user" in an earlier blog. So to start with; what is the meaning of "exposed keywords"? I mean with this "keywords such as they occur in the presentation of the titles which were actually seen, perhaps even read, by the user". I' am interested in these keywords.

The enumeration of keywords in just one title can indeed be considered as a very small network. All these keywords are somehow linked to one another. Therefore, the first step is to gather all the presented titles and the second step is to collect all these small networks of keywords and then, lastly, to create one huge file which can be used by Gephi.

In the table below I give some examples of the file structure. In the left column you see five keywords (for Gephi they are nodes), each with a count of one, called 'use'. Underneath that you see the unique combinations of the keywords (for Gephi they are edges), also with a count, called 'weight'. The number after capital P is a unique keyword identifier. Gephi likes doing arithmetic with simple codes instead of -sometimes- long strings with weird characters in it. In the middle column you see the same, except now the keywords are from another title. In the third column both sets of keywords are combined. Take notice of the keyword 'Women', it occurs in both titles, therefore in the third column the 'use' is raised to two. At the bottom of each column you see the corresponding Gephi map.

nodedef>name VARCHAR,label VARCHAR, use INT
P076244229,"Middle East",1
P076242366,"Women",1
P076239810,"Islam",1
P076243265,"Family law",1
P07624234X,"Islamic law",1
edgedef>node1 VARCHAR,node2 VARCHAR,weight INT
P076244229,P076242366,1
P076244229,P076239810,1
P076244229,P076243265,1
P076244229,P07624234X,1
P076242366,P076239810,1
P076242366,P076243265,1
P076242366,P07624234X,1
P076239810,P076243265,1
P076239810,P07624234X,1
P076243265,P07624234X,1




















































nodedef>name VARCHAR,label VARCHAR, use INT
P076239519,"Refugees",1
P076242986,"Asylum",1
P076242366,"Women",1
P356258076,"Girls",1
P076256758,"Immigration",1
P241824206,"Convention relating to the Status of Refugees (Geneva, 28 July 1951)",1
P252290518,"Sex crimes",1
P255990391,"Gender",1
P332051005,"E-docs",1
edgedef>node1 VARCHAR,node2 VARCHAR,weight INT
P076239519,P076242986,1
P076239519,P076242366,1
P076239519,P356258076,1
P076239519,P076256758,1
P076239519,P241824206,1
P076239519,P252290518,1
P076239519,P255990391,1
P076239519,P332051005,1
P076242986,P076242366,1
P076242986,P356258076,1
P076242986,P076256758,1
P076242986,P241824206,1
P076242986,P252290518,1
P076242986,P255990391,1
P076242986,P332051005,1
P076242366,P356258076,1
P076242366,P076256758,1
P076242366,P241824206,1
P076242366,P252290518,1
P076242366,P255990391,1
P076242366,P332051005,1
P356258076,P076256758,1
P356258076,P241824206,1
P356258076,P252290518,1
P356258076,P255990391,1
P356258076,P332051005,1
P076256758,P241824206,1
P076256758,P252290518,1
P076256758,P255990391,1
P076256758,P332051005,1
P241824206,P252290518,1
P241824206,P255990391,1
P241824206,P332051005,1
P252290518,P255990391,1
P252290518,P332051005,1
P255990391,P332051005,1

















nodedef>name VARCHAR,label VARCHAR, use INT
P076244229,"Middle East",1
P076242366,"Women",2
P076239810,"Islam",1
P076243265,"Family law",1
P07624234X,"Islamic law",1
P076239519,"Refugees",1
P076242986,"Asylum",1
P356258076,"Girls",1
P076256758,"Immigration",1
P241824206,"Convention relating to the Status of Refugee (Geneva, 28 July 1951)",1
P252290518,"Sex crimes",1
P255990391,"Gender",1
P332051005,"E-docs",1
edgedef>node1 VARCHAR,node2 VARCHAR,weight INT
P076244229,P076242366,1
P076244229,P076239810,1
P076244229,P076243265,1
P076244229,P07624234X,1
P076242366,P076239810,1
P076242366,P076243265,1
P076242366,P07624234X,1
P076239810,P076243265,1
P076239810,P07624234X,1
P076243265,P07624234X,1
P076239519,P076242986,1
P076239519,P076242366,1
P076239519,P356258076,1
P076239519,P076256758,1
P076239519,P241824206,1
P076239519,P252290518,1
P076239519,P255990391,1
P076239519,P332051005,1
P076242986,P076242366,1
P076242986,P356258076,1
P076242986,P076256758,1
P076242986,P241824206,1
P076242986,P252290518,1
P076242986,P255990391,1
P076242986,P332051005,1
P076242366,P356258076,1
P076242366,P076256758,1
P076242366,P241824206,1
P076242366,P252290518,1
P076242366,P255990391,1
P076242366,P332051005,1
P356258076,P076256758,1
P356258076,P241824206,1
P356258076,P252290518,1
P356258076,P255990391,1
P356258076,P332051005,1
P076256758,P241824206,1
P076256758,P252290518,1
P076256758,P255990391,1
P076256758,P332051005,1
P241824206,P252290518,1
P241824206,P255990391,1
P241824206,P332051005,1
P252290518,P255990391,1
P252290518,P332051005,1
P255990391,P332051005,1





Of course I do not create these lengthy (15.000 lines and more) files by hand. I wrote a couple of crude PHP scripts to generate a crude file. This file I clean up with R and Microsoft Excel and the resulting file is ready to be used by Gephi. The scripts use a MongoDB collection, which contains all the logging of OPAC use in our readingroom. It is possible to detect 'exposed titles' (so also the keywords therein) in this logging.

To conclude. This is all very technical stuff and we may not expect our users to do this kind of research themselves, based on rough data provided by the library. However, some library staff members should certainly be able to do this. And then communicate about the results, using interesting maps for instance. Communicate to management about library collection issues, communicate to users about trends, communicate about almost lost niches in the collection, communicate about actual, important subcollections which can be used in updating dossiers, research guides, alerting systems, etc.

My other blogs about 'Gephi in libraries':

donderdag 20 november 2014

Just below the surface.

Using Gephi("an interactive visualization and exploration platform for all kinds of networks") to create unprocessed maps of exposed keywords to the user in the library of the Peace Palace, will result in an image in which a few huge subjects will dominate. These subjects are indicators of the core business of the library: Human Rights, European Union, United States of America, International Law and International Criminal Law to name just a few.To the left you see a very reduced image of such a map, but a few main keywords are still discernible.
These extra large topics veil the keywords just below. Zooming in will eventually bring you to the overshadowed keywords, but at a very deep level, so you will lose an overview of the structure. To the left we have zoomed in on an area clearly dominated by 'Human rights'. Now if I remove 'Human rights' from this cluster, Gephi will recalculate a lot of values, because one predominant element has been removed. After all, all keywords consitute one network. So the map gets a new shape. Especially, if all of the above mentioned subjects are removed and that is exactly what I have done. All the veiled keywords will float to the surface.
Let us now choose another criterion, in stead of the number of times a keyword occurs, to create a map. Gephi gives us a few other options, one of them is betweenness centrality.

Et voilà, after using the option 'rank parameter' in Gephi and choosing for betweenness centrality a new overall map appears, now with new highlighted nodes or keywords. Before zooming in, I will try to explain what betweenness centrality is. In brief, betweenness centrality is an indicator value for a key position. The higher the value the more important the role of the keyword. This value is calculated by counting the shortest paths between two keywords in our network. The keyword which appears the most times as being in between two different keywords, has the highest betweenness centrality value; these keywords are brokers or intermediaries. I used these values to create the map at the left.

After zooming in a little on the section of the map where the overall keyword 'Human Rights' used to be, a new picture arises. We see keywords like Children, Women and Family law, all of course related to Human rights and quite a few of them with a high key position or broker value. In short a new picture of related subjects emerges, indicating what the library of the Peace Palace could provide to its users.

By the way, the relations of keywords with a high betweenness centrality are not restricted to just one general subject. Between this kind of keywords there could be dense relations to other general subjects as well, see the image to the left.

Using Gephi maps not only give students and scholars a tool in hand to explore the collections of libraries, it also is a clear reminder of the necessity of using keywords to conduct efficient bibliographic research.


Those of you who would like to have the data file used in Gephi to create all the maps shown, do contact me at a.janson at ppl dot nl.

My other blogs about 'Gephi in libraries':