donderdag 11 december 2014

Country profiles and OPAC use.

The library of the Peace Palace serves a global community. It is global because I can see this in the standard logging of the website. The use of all our website pages ends up in a log file and every line in this logfile contains the ip-number of the one who is using that specific page. This ip-number can be translated to a country of origin.

And that is what I did in my blog "MOOC: learning and instruction: Tableau and library use". In this blog I presented several maps, one of them dealing with the use of our 'human rights' website pages. The map is to the left (sorry about Alaska). One might say that the use of these pages is at least partly motivated by searching the Internet. And indeed that is usually the case. There will be only a handful of people who have added the library of the Peace Palace in their bookmarks.

Of course it is possible to follow the users if they move around on our website, but -in general and in truth- they mostly leave short tracks. I even think they are too short to make any substantiated conclusions about what our website users are looking for exactly. It is a lot easier to rise above the personal and to pay more attention to the geographical level. That means creating world maps, just to start with.

Next to the website, libraries also provide an OPAC and normally these libraries have a web search interface to their collections. And our OPAC server, you guessed it, produces log files. These log files look a lot different than the log files of regular web servers. Since the beginning of this week we collect these files (thanks to OCLC, The Netherlands) in order to parse them, store relevant data in a database and draw some conclusions. Like I stated, these files are a bit more complicated then the web server log files I usually look at, but I'am sure that in the next couple of months I will be able to deal with them. Just a random example of one log entry:

#XXX.XXX.XXX.XX 60765 1418079639.497074 GET /DB=1/SET=2/TTL=1/CMD?ACT=SRCHA&IKT=4&SRT=YOP&TRM=population HTTP/1.1
Host: catalogue.ppl.nl
Connection: keep-alive
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8
User-Agent: Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.71 Safari/537.36
Referer: http://catalogue.ppl.nl/DB=1/SET=2/TTL=1/NXT?FRST=16
Accept-Encoding: gzip, deflate, sdch
Accept-Language: fr-FR,fr;q=0.8,en-US;q=0.6,en;q=0.4
Cookie: DB="1"; PSC_1="CMD_COLLAPSE%07N%07%08FKT%074%07%08FRM%07population%07%08IMPLAND%07Y%07%08LNG%07EN%07%08LRSET%072%07%08REFERER%07%07%08SET%072%07%08SID%07b6cadb8a-0%07%08SRT%07YOP%07%08TTL%071%07%08XSLBASE%07http%3A%2F%2Flbs-vrep.oclc.org%3A8282%2Foclc_gui%07%08XSLFILE%07%253Fid%253D$c%2526db%253D$d%07%08"

Where you see all the XX's there is the ip-number, which I replaced for obvious reasons. On the same line I see 'IKT' which indicates the index used for searching and 'TRM' with the actual term looked for, 'population'. I also see our user has found another set of results before (SET=2, there must be a previous SET=1). I also notice -in Cookie:- the session identifier (b6cadb8a-0), which I can use to recreate all the actions of our user. In summary, I can say that there are a lot of possibilities to collect useful information. Information the library can use to provide optimal and actual information and services like instruction to its users.

To end this blog I will show two maps -I used Tableau- containing counts of succesful searches in our OPAC during just two days, 8-9th December. One map shows the use of our OPAC in Europe and not surprisingly, The Netherlands score the best.


In the other map I had to leave out The Netherlands to create sufficient distinction in colour. This map contains the same data as the map above. 


woensdag 3 december 2014

Presenting keywords. Eh? Which keywords? And how?

It is always hard to get started with bibliographic research. Especially for those patrons and scholars who realize it is important not to just search by some words using a very general index, but to use keywords (sometimes also called subject headings). These users know that a very dedicated group of librarians has thoroughly examined the publications added to their OPAC and has enriched them with keywords. And therein lies a problem, because one might ask: "how do these keywords look like?".

In my last couple of blogs I was focusing on how to get an idea of subject areas (huge and small). I used Gephi to create maps which could indeed give an impression of subjects and subject areas. For technical reasons, however, these maps dealt with only a relatively small set of data. In the latest published map I only incorporated about 4,500 exposed titles to our users in the reading room of the library. This looks like it is much, but in fact it is not. Therefore smaller subjects may not appear during this specific time frame or stay unnoticed. To get a more thorough impression of the subjects and the keywords used in the library one should use as large a set of information as possible. Luckily I have collected such a set, but before I come to that, I like to sketch a situation.

So, suppose some patron is looking for publications about biological warfare and the Security Council (we are after all the Peace Palace Library). Chances are (about 70% of our users do so) he uses the all words index in our OPAC and types 'biological warfare security council'. No hits. He then tries 'biological warfare' using the same index. 92 hits, which he quickly scans to look for what interests him (about 6 pages of title info). He then tries 'security council'. 1,416, hits which he does not scan, because it is to much. Now suppose he all of a sudden realizes he should also use the keyword index. Biological warfare, 213 hits. Security council, 2,741 hits. Combined (we just assume that he knows how to do this), 1 hit (a freely available PDF file containing references to other freely available texts and links to websites).

It is my opinion that libraries should bring to the forefront sets of keywords, all related to one general subject. Library users then just have to check these overviews in order to comprehend which keywords they should use in their research. In order to build up this set it is important to collect just the keywords which were, preferably during a longer time span, exposed or used by our patrons. This way we can be sure all relevant keywords can be collected. 

In January of this year I started to collect all the records, which were, in one way or another, seen or used by our patrons and visitors. This includes the records 'seen' by search robots, like the ones from Google. The database (we use for this MongoDB as a database system) contains now almost 7,500,000 records, but keep in mind that less then 10% of this number actually can be attributed to human beings. Each record contains, publication id, ip-number, time stamp and the keywords belonging to this publication. Given the size of the database it is possible to collect the really used or exposed keywords related to just one general subject.

I managed to create such a set of keywords as an example. They all were used in some combination with my 'main' keyword Biological and chemical weapons. The set contains a little bit more than 410 different keywords, some used quite a lot, others just a few times. 

So what remains is, to determine what kind of presentation to use to get a quick and thorough impression about specific keywords. I decided for now to use Tableau and to draw three diagrams with different colors. Each diagram indicates the relative amount of use and the keyword description, each next diagram present the keywords used less and less. So if you are looking for a publication about chemical warfare and genetic manipulation, after a peek at the diagrams below you will know what keywords to use to get to this information. (Curious? Look at this chapter: Terrorism in the Genomic Age / John Ellis, 2004.) 

And let's not forget serendipity.





donderdag 27 november 2014

Keywords. Collecting data.

In the last couple of weeks I blogged about keywords as they were displayed to the users of the OPAC in the library of the Peace Palace. I showed a couple of maps built with Gephi, some exhaustive, others very detailed.

But, how did I collect and adapt the data to be used by Gephi? I already mentioned "exposed keywords to the user" in an earlier blog. So to start with; what is the meaning of "exposed keywords"? I mean with this "keywords such as they occur in the presentation of the titles which were actually seen, perhaps even read, by the user". I' am interested in these keywords.

The enumeration of keywords in just one title can indeed be considered as a very small network. All these keywords are somehow linked to one another. Therefore, the first step is to gather all the presented titles and the second step is to collect all these small networks of keywords and then, lastly, to create one huge file which can be used by Gephi.

In the table below I give some examples of the file structure. In the left column you see five keywords (for Gephi they are nodes), each with a count of one, called 'use'. Underneath that you see the unique combinations of the keywords (for Gephi they are edges), also with a count, called 'weight'. The number after capital P is a unique keyword identifier. Gephi likes doing arithmetic with simple codes instead of -sometimes- long strings with weird characters in it. In the middle column you see the same, except now the keywords are from another title. In the third column both sets of keywords are combined. Take notice of the keyword 'Women', it occurs in both titles, therefore in the third column the 'use' is raised to two. At the bottom of each column you see the corresponding Gephi map.

nodedef>name VARCHAR,label VARCHAR, use INT
P076244229,"Middle East",1
P076242366,"Women",1
P076239810,"Islam",1
P076243265,"Family law",1
P07624234X,"Islamic law",1
edgedef>node1 VARCHAR,node2 VARCHAR,weight INT
P076244229,P076242366,1
P076244229,P076239810,1
P076244229,P076243265,1
P076244229,P07624234X,1
P076242366,P076239810,1
P076242366,P076243265,1
P076242366,P07624234X,1
P076239810,P076243265,1
P076239810,P07624234X,1
P076243265,P07624234X,1




















































nodedef>name VARCHAR,label VARCHAR, use INT
P076239519,"Refugees",1
P076242986,"Asylum",1
P076242366,"Women",1
P356258076,"Girls",1
P076256758,"Immigration",1
P241824206,"Convention relating to the Status of Refugees (Geneva, 28 July 1951)",1
P252290518,"Sex crimes",1
P255990391,"Gender",1
P332051005,"E-docs",1
edgedef>node1 VARCHAR,node2 VARCHAR,weight INT
P076239519,P076242986,1
P076239519,P076242366,1
P076239519,P356258076,1
P076239519,P076256758,1
P076239519,P241824206,1
P076239519,P252290518,1
P076239519,P255990391,1
P076239519,P332051005,1
P076242986,P076242366,1
P076242986,P356258076,1
P076242986,P076256758,1
P076242986,P241824206,1
P076242986,P252290518,1
P076242986,P255990391,1
P076242986,P332051005,1
P076242366,P356258076,1
P076242366,P076256758,1
P076242366,P241824206,1
P076242366,P252290518,1
P076242366,P255990391,1
P076242366,P332051005,1
P356258076,P076256758,1
P356258076,P241824206,1
P356258076,P252290518,1
P356258076,P255990391,1
P356258076,P332051005,1
P076256758,P241824206,1
P076256758,P252290518,1
P076256758,P255990391,1
P076256758,P332051005,1
P241824206,P252290518,1
P241824206,P255990391,1
P241824206,P332051005,1
P252290518,P255990391,1
P252290518,P332051005,1
P255990391,P332051005,1

















nodedef>name VARCHAR,label VARCHAR, use INT
P076244229,"Middle East",1
P076242366,"Women",2
P076239810,"Islam",1
P076243265,"Family law",1
P07624234X,"Islamic law",1
P076239519,"Refugees",1
P076242986,"Asylum",1
P356258076,"Girls",1
P076256758,"Immigration",1
P241824206,"Convention relating to the Status of Refugee (Geneva, 28 July 1951)",1
P252290518,"Sex crimes",1
P255990391,"Gender",1
P332051005,"E-docs",1
edgedef>node1 VARCHAR,node2 VARCHAR,weight INT
P076244229,P076242366,1
P076244229,P076239810,1
P076244229,P076243265,1
P076244229,P07624234X,1
P076242366,P076239810,1
P076242366,P076243265,1
P076242366,P07624234X,1
P076239810,P076243265,1
P076239810,P07624234X,1
P076243265,P07624234X,1
P076239519,P076242986,1
P076239519,P076242366,1
P076239519,P356258076,1
P076239519,P076256758,1
P076239519,P241824206,1
P076239519,P252290518,1
P076239519,P255990391,1
P076239519,P332051005,1
P076242986,P076242366,1
P076242986,P356258076,1
P076242986,P076256758,1
P076242986,P241824206,1
P076242986,P252290518,1
P076242986,P255990391,1
P076242986,P332051005,1
P076242366,P356258076,1
P076242366,P076256758,1
P076242366,P241824206,1
P076242366,P252290518,1
P076242366,P255990391,1
P076242366,P332051005,1
P356258076,P076256758,1
P356258076,P241824206,1
P356258076,P252290518,1
P356258076,P255990391,1
P356258076,P332051005,1
P076256758,P241824206,1
P076256758,P252290518,1
P076256758,P255990391,1
P076256758,P332051005,1
P241824206,P252290518,1
P241824206,P255990391,1
P241824206,P332051005,1
P252290518,P255990391,1
P252290518,P332051005,1
P255990391,P332051005,1





Of course I do not create these lengthy (15.000 lines and more) files by hand. I wrote a couple of crude PHP scripts to generate a crude file. This file I clean up with R and Microsoft Excel and the resulting file is ready to be used by Gephi. The scripts use a MongoDB collection, which contains all the logging of OPAC use in our readingroom. It is possible to detect 'exposed titles' (so also the keywords therein) in this logging.

To conclude. This is all very technical stuff and we may not expect our users to do this kind of research themselves, based on rough data provided by the library. However, some library staff members should certainly be able to do this. And then communicate about the results, using interesting maps for instance. Communicate to management about library collection issues, communicate to users about trends, communicate about almost lost niches in the collection, communicate about actual, important subcollections which can be used in updating dossiers, research guides, alerting systems, etc.

My other blogs about 'Gephi in libraries':

donderdag 20 november 2014

Just below the surface.

Using Gephi("an interactive visualization and exploration platform for all kinds of networks") to create unprocessed maps of exposed keywords to the user in the library of the Peace Palace, will result in an image in which a few huge subjects will dominate. These subjects are indicators of the core business of the library: Human Rights, European Union, United States of America, International Law and International Criminal Law to name just a few.To the left you see a very reduced image of such a map, but a few main keywords are still discernible.
These extra large topics veil the keywords just below. Zooming in will eventually bring you to the overshadowed keywords, but at a very deep level, so you will lose an overview of the structure. To the left we have zoomed in on an area clearly dominated by 'Human rights'. Now if I remove 'Human rights' from this cluster, Gephi will recalculate a lot of values, because one predominant element has been removed. After all, all keywords consitute one network. So the map gets a new shape. Especially, if all of the above mentioned subjects are removed and that is exactly what I have done. All the veiled keywords will float to the surface.
Let us now choose another criterion, in stead of the number of times a keyword occurs, to create a map. Gephi gives us a few other options, one of them is betweenness centrality.

Et voilĂ , after using the option 'rank parameter' in Gephi and choosing for betweenness centrality a new overall map appears, now with new highlighted nodes or keywords. Before zooming in, I will try to explain what betweenness centrality is. In brief, betweenness centrality is an indicator value for a key position. The higher the value the more important the role of the keyword. This value is calculated by counting the shortest paths between two keywords in our network. The keyword which appears the most times as being in between two different keywords, has the highest betweenness centrality value; these keywords are brokers or intermediaries. I used these values to create the map at the left.

After zooming in a little on the section of the map where the overall keyword 'Human Rights' used to be, a new picture arises. We see keywords like Children, Women and Family law, all of course related to Human rights and quite a few of them with a high key position or broker value. In short a new picture of related subjects emerges, indicating what the library of the Peace Palace could provide to its users.

By the way, the relations of keywords with a high betweenness centrality are not restricted to just one general subject. Between this kind of keywords there could be dense relations to other general subjects as well, see the image to the left.

Using Gephi maps not only give students and scholars a tool in hand to explore the collections of libraries, it also is a clear reminder of the necessity of using keywords to conduct efficient bibliographic research.


Those of you who would like to have the data file used in Gephi to create all the maps shown, do contact me at a.janson at ppl dot nl.

My other blogs about 'Gephi in libraries':

donderdag 13 november 2014

Keywords! Maps! Let's dive in.


Last week I blogged about maps and keywords: Library and user: one interest? I presented a few maps, created with Gephi, with which I tried to compare the activities of the library staff with the interests of OPAC users. I talked about general subjects like 'international criminal law', 'space debris' and things like that.

These maps can also be used to get a detailed picture, although I admit the presented maps are a bit difficult to read after zooming in. However, librarians can use Gephi itself to do detailed research in order to find out what our patrons are looking for.
See for example this image, clipped from the Gephi overview graph frame, which shows keywords all about art, trade and illegal activities in just a tiny section of the map. I think librarians can use such insights to better facilitate their users, especially if they detect returning patterns in searches during a longer period.

If librarians can 'translate' these insights in more relevant acquisitions, improvements in their research guides (in this case the Peace Palace Library, Cultural Heritage) or write specific blogs or tweets, I'am sure interested visitors will return to the library.

Of course users can manipulate the map with OPAC searches and focus on just one group (to the left you see the International Criminal Law group), but even one large group can be quite intimidating. Nevertheless those users who take some time can obtain a thorough knowledge about keywords grouped around one or two core subjects of the Peace Palace Library. Just start selecting a group using 'Group Selector' then click the largest bubble and check all the other keywords in the 'Information Pane'.

Librarians may use some of the more specific possibilities Gephi offers to look at maps in a very specific way. They may use for instance "Betweenness Centrality numbers" to look at 'broker' keywords, thus getting an idea about intermediaries. This knowledge too, I repeat, can be used to better respond to the needs of library users. I will write about this another time.

donderdag 6 november 2014

Library and user: one interest?

Quote: "But also interesting is, to see whether the library staff takes the interests of the patrons into account while acquiring documents for their collection? That is a subject for another blog."

Here I am referring to an earlier blogpost in which I tried to show what our users are looking for in the OPAC of the Peace Palace Library. In order to make this happen I focused on the use of our link resolver and presentation of a general subject in this link resolver. I used Tableau to create some graphs. 


However, the same thing can be done on the basis of the title descriptions which appeared on the screens in the reading room of the library after a succesful search. So, I collected all these titles and used all the keywords added to these titles to create a map using Gephi. In yet another blog I reported about this, although over there I used the recent acquisitions of the month of September.


In order to gain insight to answer the question "whether the library staff takes the interests of the patrons into account while acquiring documents for their collection?" I created two maps for comparison. One about the acquisitions in October and the other about the use of the OPAC in the readingroom in the same month. 



Acquisitions OPAC

If I enumerate the main subject topics which can be identified on indicated webpages, we get the following lists:


Acq:
  • International criminal law
  • Human rights
  • European Union
  • International trade
  • History
  • United Nations
  • Private international law
  • Islam/Islamic law
  • Law of the sea
  • Immigration
OPAC:
  • International criminal law*
  • Human rights*
  • European Union*
  • United Nations*
  • Intermational humanitarian law
  • International commercial arbitration
  • History/Politics*
  • Environmental protection
  • Law of the sea*
  • Space law

So our user behavior indicates special interest in Space, Environment, Commerce -among other things- which were not covered by our library staff. However, the library acquired material about Immigration, Islamic Law and Trade which was not looked for by our OPAC users. But of great importance is still the observation that both parties share their interest in the core business of the library of the Peace Palace: Criminal law, Human rights, European Union.

Only with regard to the peripheral areas differences exist and for a large part that can be related to current events, like boat refugees in the Mediterranean Sea, terrorism in the Middle East, space debris and environmental issues. 

Anyway, the simple fact that the 'small subjects' are also found and acquired, means that the library of the Peace Palace is on the right track. The 'small subjects' looked for now, were added in the past!


woensdag 29 oktober 2014

Subjects by keywords.

Each library buys books, journals or access to the e-version of these, files, databases, etc. In each library, there is a specific focus on a particular field of interest or -more often- multiple fields of interest. In the library of the Peace Palace 'documents' are acquired in the field of international law. Of course it is possible to recognize a large number of sub fields within this vast subject: international criminal law, human rights, diplomacy and so on.

For the users of the library, it is important to know which of these areas are covered and whether therefore it is worthwhile to use that library when you yourself are dealing with such a subject. One of the tools the library is using, is a system of sending regular 'alerts'. Interested scholars, students and other interested parties can be informed about the most recent acquisitions, once a week. There are around 1.000 subscribers to this service. They receive a weekly overview, based on a single criterion. The consequence of using a single general criterion is of course that the outcome may be huge. Especially topics like 'European Union' or 'Public International Law' may contain quite a lot of bibliographical references.

But what if you have a way of presenting the acquisitions in one month, not on the basis of one single general criterion, but where combinations of keywords assigned to each title play a role? What if relationships between those keywords can be visualized on the website in stead of emailed to a subscriber? In an attempt to make that possible I created a clickable map, which can be found on http://www.ppl.nl/september.

To create this map I used the visualization tool 'Gephi', which is especially strong in showing the links between the building blocks of the map. So I consider, for this purpose, the keywords as building blocks. On above mentioned web address, the acquisitions from the month of September are recorded, not in the form of boring title lists, but in the form of assigned keywords and relationships between those keywords. It is still all about numbers, as the strength of a relation is determined by the amount of occurrences of keyword pairs.

Of course in a batch of thousands of titles some areas in the map should be indicated by large blobs of tightly connected keywords. These blobs refer to the core businesses of the library. At the edges of the map, smaller subareas appear. In other words, large areas of the map show acquisitions that are always extremely important to the library. They can be recognized, because the largest circles in the map appear over there, surrounded by a large amount of closely packed smaller circles.


Subtopics are located outside the center of the map and can be considered as subjects which are farther away from the core business. So the area in the upper right is characterized by the keyword 'History', especially the history of the First World War. Typical related keywords are: 'Military History', 'Massacres', 'Ethnic Minorities', 'Russian Empire', etc. In the bottom left there is an area that relates to commerce, trade, international commercial arbitration, etc.

It must be stressed that subareas differ each month. It all depends on what is going on in the world. Highlighted in the last couple of months of this year is of course the commemoration of the start of the First World War. But other current events can also lead to a temporary increase in attention, such as transboundary pollution of the environment, disease outbreak, cyber warfare, sporting events, etc.

Clicking on a circle produces an 'Information Pane' where additional information can be found on that keyword, like how many times it occurs, but also the related keywords -those that are used in combination with the clicked keyword- are mentioned. So, with a few clicks it is easy to get a good impression about different topics and topic areas.

Finally, you are invited to use the map by hovering over the various items and view the information in the right pane. Are you looking for a specific topic? You can search by using the Search box in the left panel or zoom in on a specific group or cluster by 'Group Selector'.

woensdag 22 oktober 2014

MOOC: learning and instruction: Tableau and library use.

On October 20, 2014 a new MOOC course "Data, Analytics and Learning" ,hashtag #dalmooc, started. I take part in this course. In addition to a new approach with regard to the process of learning itself, there is also the usual presentation of the course, thus with video, text, references to relevant literature and assignments.

One of the first tools asked to be studied is Tableau, a tool to examine, analyze and present data in an visually appealing and very informative manner.  Luckily, participants to the course get a code which can be used to install the full desktop version of the software, at least during the course (until January 2015). To be clear, Tableau is not free software.

After Googling around looking for instructions, manuals and the like, I realized there were no example files shown or used in Tableau instruction videos, which could be related to libraries, or more specifically, to OPACs used in libraries. As I work in the library of the Peace Palace I thought to collect some library data and use it in Tableau, just as an exercise.

Every time our link resolver is used, some data is stored in a database, We use a MongoDB database for this purpose. At the time of this writing we have collected a little more than half a million of these documents. We store among other things a time stamp, country of user based on ip, general subject information and short bibliographic information. Although I know that it is possible to connect Tableau to a MongoDB server using a special ODBC connector, I will still use an excel file -to keep things simple- in Tableau to generate some also very simple graphics.

The file contains just country, general subject in coded format, i.e. a number, day number and will be limited to link resolver use in 2014. With these we still have some 370.000 rows!

We see the most populair subject, not surprisingly, is 'European Union (42)' the second popular subject is 'Human Rights (60)' and not so popular is 'Mutual Cooperation in Criminal Matters (147)'.

Let us focus on 'Human Rights' and see which countries are the most interested in this subject, based on the number of clicks per country.

For contrast the subject 'History' and please remember the sizes of the dots are relevant to the subject, they do not represent actual numbers:

This is all very interesting, because we have here an indication of what our users look for and where they come from. But also interesting is, to see whether the library staff takes the interests of the patrons into account while acquiring documents for their collection? That is a subject for another blog.

To conclude. With Tableau it is easy to understand what is important or interesting to students and scholars using our library OPAC and / or link resolver. And I just scratched the surface of this software....

dinsdag 21 oktober 2014