Almost a year ago, a friend of mine and I made an experiment, a visualization of the references between papers published in the ACM (Association of Computer Machinery) article library. Each paper is cited by other papers and so on, we represented those relationships in a directed graph: each node is a document and each edge is a reference from one to another.
To make it happen, my friend Javier crawled the ACM library and got banned lots of times. He had to restart his modem so often that we had to “borrow” some computers at our university to run the crawler and get the info. After two days we had the disappointing amount of 10,000 articles. We expected many more, but the ACM anti-crawling rules got us fetching only 2 articles per minute. Anyway they were enough to play, so I made a tiny REST server to store the information and serve it to a web interface… the result was pretty cool!
See it in action here. You can type and search for an article, or an author. Try “Bayesian” or “Policarpo”.
Then the thing got better, we discovered a couple of “inconsistencies” within the library, at first we thought it was our fault, some errors in our DB maybe. But they were not. The DB was good and the crawler was good, here is what we found:
Articles that refer to themselves
Because f**k you that’s why. As you can see at the bottom of this post, ACM’s explanation involves errors in the Optical Character Recognition program they use to obtain the references from the article. However I wonder how could this happen by a mere OCR mistake. A few examples:
Articles that quote each other
How about that! Such a thing should not be possible, there cannot be two papers based on each other, loops are not legal in this graph. However I was quite surprised about the abundance of things like these, we detected more than 100 occurrences in our small 14k dataset. Examples here:
This is the ACM Digital Library note, mentioned above:
OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
So that is all, we made this experiment in 3 days as a part of a small but very cool competition, organized by a small yet innovative company called Edis.