Thoughts about ...: MongoDB

Posts tonen met het label MongoDB. Alle posts tonen

woensdag 3 december 2014

Presenting keywords. Eh? Which keywords? And how?

It is always hard to get started with bibliographic research. Especially for those patrons and scholars who realize it is important not to just search by some words using a very general index, but to use keywords (sometimes also called subject headings). These users know that a very dedicated group of librarians has thoroughly examined the publications added to their OPAC and has enriched them with keywords. And therein lies a problem, because one might ask: "how do these keywords look like?".

In my last couple of blogs I was focusing on how to get an idea of subject areas (huge and small). I used Gephi to create maps which could indeed give an impression of subjects and subject areas. For technical reasons, however, these maps dealt with only a relatively small set of data. In the latest published map I only incorporated about 4,500 exposed titles to our users in the reading room of the library. This looks like it is much, but in fact it is not. Therefore smaller subjects may not appear during this specific time frame or stay unnoticed. To get a more thorough impression of the subjects and the keywords used in the library one should use as large a set of information as possible. Luckily I have collected such a set, but before I come to that, I like to sketch a situation.

So, suppose some patron is looking for publications about biological warfare and the Security Council (we are after all the Peace Palace Library). Chances are (about 70% of our users do so) he uses the all words index in our OPAC and types 'biological warfare security council'. No hits. He then tries 'biological warfare' using the same index. 92 hits, which he quickly scans to look for what interests him (about 6 pages of title info). He then tries 'security council'. 1,416, hits which he does not scan, because it is to much. Now suppose he all of a sudden realizes he should also use the keyword index. Biological warfare, 213 hits. Security council, 2,741 hits. Combined (we just assume that he knows how to do this), 1 hit (a freely available PDF file containing references to other freely available texts and links to websites).

It is my opinion that libraries should bring to the forefront sets of keywords, all related to one general subject. Library users then just have to check these overviews in order to comprehend which keywords they should use in their research. In order to build up this set it is important to collect just the keywords which were, preferably during a longer time span, exposed or used by our patrons. This way we can be sure all relevant keywords can be collected.

In January of this year I started to collect all the records, which were, in one way or another, seen or used by our patrons and visitors. This includes the records 'seen' by search robots, like the ones from Google. The database (we use for this MongoDB as a database system) contains now almost 7,500,000 records, but keep in mind that less then 10% of this number actually can be attributed to human beings. Each record contains, publication id, ip-number, time stamp and the keywords belonging to this publication. Given the size of the database it is possible to collect the really used or exposed keywords related to just one general subject.

I managed to create such a set of keywords as an example. They all were used in some combination with my 'main' keyword Biological and chemical weapons. The set contains a little bit more than 410 different keywords, some used quite a lot, others just a few times.

So what remains is, to determine what kind of presentation to use to get a quick and thorough impression about specific keywords. I decided for now to use Tableau and to draw three diagrams with different colors. Each diagram indicates the relative amount of use and the keyword description, each next diagram present the keywords used less and less. So if you are looking for a publication about chemical warfare and genetic manipulation, after a peek at the diagrams below you will know what keywords to use to get to this information. (Curious? Look at this chapter: Terrorism in the Genomic Age / John Ellis, 2004.)

And let's not forget serendipity.

donderdag 27 november 2014

Keywords. Collecting data.

In the last couple of weeks I blogged about keywords as they were displayed to the users of the OPAC in the library of the Peace Palace. I showed a couple of maps built with Gephi, some exhaustive, others very detailed.

But, how did I collect and adapt the data to be used by Gephi? I already mentioned "exposed keywords to the user" in an earlier blog. So to start with; what is the meaning of "exposed keywords"? I mean with this "keywords such as they occur in the presentation of the titles which were actually seen, perhaps even read, by the user". I' am interested in these keywords.

The enumeration of keywords in just one title can indeed be considered as a very small network. All these keywords are somehow linked to one another. Therefore, the first step is to gather all the presented titles and the second step is to collect all these small networks of keywords and then, lastly, to create one huge file which can be used by Gephi.

In the table below I give some examples of the file structure. In the left column you see five keywords (for Gephi they are nodes), each with a count of one, called 'use'. Underneath that you see the unique combinations of the keywords (for Gephi they are edges), also with a count, called 'weight'. The number after capital P is a unique keyword identifier. Gephi likes doing arithmetic with simple codes instead of -sometimes- long strings with weird characters in it. In the middle column you see the same, except now the keywords are from another title. In the third column both sets of keywords are combined. Take notice of the keyword 'Women', it occurs in both titles, therefore in the third column the 'use' is raised to two. At the bottom of each column you see the corresponding Gephi map.

nodedef>name VARCHAR,label VARCHAR, use INT
P076244229,"Middle East",1
P076242366,"Women",1
P076239810,"Islam",1
P076243265,"Family law",1
P07624234X,"Islamic law",1
edgedef>node1 VARCHAR,node2 VARCHAR,weight INT
P076244229,P076242366,1
P076244229,P076239810,1
P076244229,P076243265,1
P076244229,P07624234X,1
P076242366,P076239810,1
P076242366,P076243265,1
P076242366,P07624234X,1
P076239810,P076243265,1
P076239810,P07624234X,1
P076243265,P07624234X,1

nodedef>name VARCHAR,label VARCHAR, use INT
P076239519,"Refugees",1
P076242986,"Asylum",1
P076242366,"Women",1
P356258076,"Girls",1
P076256758,"Immigration",1
P241824206,"Convention relating to the Status of Refugees (Geneva, 28 July 1951)",1
P252290518,"Sex crimes",1
P255990391,"Gender",1
P332051005,"E-docs",1
edgedef>node1 VARCHAR,node2 VARCHAR,weight INT
P076239519,P076242986,1
P076239519,P076242366,1
P076239519,P356258076,1
P076239519,P076256758,1
P076239519,P241824206,1
P076239519,P252290518,1
P076239519,P255990391,1
P076239519,P332051005,1
P076242986,P076242366,1
P076242986,P356258076,1
P076242986,P076256758,1
P076242986,P241824206,1
P076242986,P252290518,1
P076242986,P255990391,1
P076242986,P332051005,1
P076242366,P356258076,1
P076242366,P076256758,1
P076242366,P241824206,1
P076242366,P252290518,1
P076242366,P255990391,1
P076242366,P332051005,1
P356258076,P076256758,1
P356258076,P241824206,1
P356258076,P252290518,1
P356258076,P255990391,1
P356258076,P332051005,1
P076256758,P241824206,1
P076256758,P252290518,1
P076256758,P255990391,1
P076256758,P332051005,1
P241824206,P252290518,1
P241824206,P255990391,1
P241824206,P332051005,1
P252290518,P255990391,1
P252290518,P332051005,1
P255990391,P332051005,1

nodedef>name VARCHAR,label VARCHAR, use INT
P076244229,"Middle East",1
P076242366,"Women",2
P076239810,"Islam",1
P076243265,"Family law",1
P07624234X,"Islamic law",1
P076239519,"Refugees",1
P076242986,"Asylum",1
P356258076,"Girls",1
P076256758,"Immigration",1
P241824206,"Convention relating to the Status of Refugee (Geneva, 28 July 1951)",1
P252290518,"Sex crimes",1
P255990391,"Gender",1
P332051005,"E-docs",1
edgedef>node1 VARCHAR,node2 VARCHAR,weight INT
P076244229,P076242366,1
P076244229,P076239810,1
P076244229,P076243265,1
P076244229,P07624234X,1
P076242366,P076239810,1
P076242366,P076243265,1
P076242366,P07624234X,1
P076239810,P076243265,1
P076239810,P07624234X,1
P076243265,P07624234X,1
P076239519,P076242986,1
P076239519,P076242366,1
P076239519,P356258076,1
P076239519,P076256758,1
P076239519,P241824206,1
P076239519,P252290518,1
P076239519,P255990391,1
P076239519,P332051005,1
P076242986,P076242366,1
P076242986,P356258076,1
P076242986,P076256758,1
P076242986,P241824206,1
P076242986,P252290518,1
P076242986,P255990391,1
P076242986,P332051005,1
P076242366,P356258076,1
P076242366,P076256758,1
P076242366,P241824206,1
P076242366,P252290518,1
P076242366,P255990391,1
P076242366,P332051005,1
P356258076,P076256758,1
P356258076,P241824206,1
P356258076,P252290518,1
P356258076,P255990391,1
P356258076,P332051005,1
P076256758,P241824206,1
P076256758,P252290518,1
P076256758,P255990391,1
P076256758,P332051005,1
P241824206,P252290518,1
P241824206,P255990391,1
P241824206,P332051005,1
P252290518,P255990391,1
P252290518,P332051005,1
P255990391,P332051005,1

Of course I do not create these lengthy (15.000 lines and more) files by hand. I wrote a couple of crude PHP scripts to generate a crude file. This file I clean up with R and Microsoft Excel and the resulting file is ready to be used by Gephi. The scripts use a MongoDB collection, which contains all the logging of OPAC use in our readingroom. It is possible to detect 'exposed titles' (so also the keywords therein) in this logging.

To conclude. This is all very technical stuff and we may not expect our users to do this kind of research themselves, based on rough data provided by the library. However, some library staff members should certainly be able to do this. And then communicate about the results, using interesting maps for instance. Communicate to management about library collection issues, communicate to users about trends, communicate about almost lost niches in the collection, communicate about actual, important subcollections which can be used in updating dossiers, research guides, alerting systems, etc.

My other blogs about 'Gephi in libraries':

woensdag 22 oktober 2014

MOOC: learning and instruction: Tableau and library use.

On October 20, 2014 a new MOOC course "Data, Analytics and Learning" ,hashtag #dalmooc, started. I take part in this course. In addition to a new approach with regard to the process of learning itself, there is also the usual presentation of the course, thus with video, text, references to relevant literature and assignments.

One of the first tools asked to be studied is Tableau, a tool to examine, analyze and present data in an visually appealing and very informative manner. Luckily, participants to the course get a code which can be used to install the full desktop version of the software, at least during the course (until January 2015). To be clear, Tableau is not free software.

After Googling around looking for instructions, manuals and the like, I realized there were no example files shown or used in Tableau instruction videos, which could be related to libraries, or more specifically, to OPACs used in libraries. As I work in the library of the Peace Palace I thought to collect some library data and use it in Tableau, just as an exercise.

Every time our link resolver is used, some data is stored in a database, We use a MongoDB database for this purpose. At the time of this writing we have collected a little more than half a million of these documents. We store among other things a time stamp, country of user based on ip, general subject information and short bibliographic information. Although I know that it is possible to connect Tableau to a MongoDB server using a special ODBC connector, I will still use an excel file -to keep things simple- in Tableau to generate some also very simple graphics.

The file contains just country, general subject in coded format, i.e. a number, day number and will be limited to link resolver use in 2014. With these we still have some 370.000 rows!

We see the most populair subject, not surprisingly, is 'European Union (42)' the second popular subject is 'Human Rights (60)' and not so popular is 'Mutual Cooperation in Criminal Matters (147)'.

Let us focus on 'Human Rights' and see which countries are the most interested in this subject, based on the number of clicks per country.

For contrast the subject 'History' and please remember the sizes of the dots are relevant to the subject, they do not represent actual numbers:

This is all very interesting, because we have here an indication of what our users look for and where they come from. But also interesting is, to see whether the library staff takes the interests of the patrons into account while acquiring documents for their collection? That is a subject for another blog.

To conclude. With Tableau it is easy to understand what is important or interesting to students and scholars using our library OPAC and / or link resolver. And I just scratched the surface of this software....