Category Archives: Computational Linguistics

Oops: there goes the blog in 2012

I knew I was neglecting my blog in 2012, but I did not realize just how much until I received WordPress’ year in review for 2012 (Feel free to take a peek at it). The line that stopped me dead was “In 2012, there was 1 new post”. Sure enough – one post it was.

Well, this blog might be comatose, but I am not dead. In fact, quite the opposite, so busy that there is very little time for crafting articles.

Continue reading Oops: there goes the blog in 2012

Conjunctions in named entities

A recent article on lingpipe discussed conjuncted named entities such as Johnson and Johnson and Wallace and Gromit. They suggest that maybe a way of treating this is as a frozen expression. I assume that means relying on statistical measures to see this Multi-Word-Expression repeating enough times to be treated as a unit.

In the United Nations corpus, things can get even more interesting. Let’s look at a relatively easy example: draft resolution A/56/L.28 and Add.1.

Is this a one document (one draft resolution) or two? And if two, then which two? The first one is obviously A/56/L.28. But Add.1 is not a valid document symbol, it is actually an (additive?) coreference to the first one and resolves to A/56/L.28/Add.1?

The answer (as good as I can make it so far) could lie in FRBR distinction between Expression and Manifestation. A resolution is an expression of Member States’s proposals and negotiations. To some degree, it evolves over several meetings. However between the discussions, the latest version or changes need to be reported to make sure they are formally registered and also to ensure the next round of discussions could have latest documents to work from.

In our case, the first time the draft resolution had to be presented it was published under A/56/L.28 (which incidentally means a limited distribution document 28 of the General Assembly’s 56th regular session). So, the initial Manifestation of the draft resolution became this physical document with a distinct symbol assigned.

But apart from its text, draft resolution has a list of sponsoring Member States. That list can change as draft resolution gains sponsors. These additional sponsors were in the Addendum A/56/L.28/Add.1. But the addendum does not make sense without the original document, so actually both physical documents represent one logical draft resolution, which is reflected in the grammar of the text (draft resolution, not resolutions).

What this means for named entity annotations and for recognition algorithms is hard to say and is something I am looking at with my PhD research.

Visualizing CiteULike collections

I am collecting my reading and reference material in CiteULike. I like the service because it can capture details from multiple sources. It also allows to discover what was collected by other interesting people through tags, people and bookmarks graph navigation.

Nice as CiteULike is, it is fairly difficult to get an overall picture of one’s own collection. It is especially difficult to see quickly if there are people who serve as hubs by collaborating with multiple different groups. The information is there, but it requires a lot of clicks to find it out.

My usual solution is to export information out, massage it into Graphviz format and use graph segmentation and layout algorithms to get a better overview. I have talked about Graphviz a number of times on this blog before. This is yet another time it proved useful.

I started by exporting CiteULike’s content of my library. I found Endnote export format to be more structured and therefore easier to parse. I then run it through a custom Python program that basically spat out graph with titles pointing at authors. That produced a very large graph and was not particularly useful.

The next step was to discover disjointed clusters of titles/authors. I used ccomps with -v and -x flags (e.g. ccomps.exe -v -x -o comp.dot output.dot).

ccomps gave me partitioned graphs as well as statistics on number of nodes/edges in each graph. I could then choose a graph with large number of nodes/edges (eventually, all of them) and run it through neato with overlap=scale and splines=true (e.g. neato.exe -Tgif -o neato_1.gif -Goverlap=scale -Gsplines=true comp_1.dot).

The resulting graph was still not perfect, but it was a good start. I also tried fdp instead of neato, but that seemed to produce giraffe versions of the graph with graph edges being overly long.

You can see an example of neato output for one of my clusters. Warning: if it causes problems due to its size, try it with IrfanView; that program can display even improbably large graphs (e.g. unpartitioned ones).

I have run into some problems as well that would either cause partitions combine together or produce duplicate nodes and edges.

The first problem was that sometimes a person was an author and sometimes an editor. I was interested in both, so collapsed those fields together. That caused some non-people to then show up on the graph and connect clusters in unexpected ways. For my library the specific value was ‘European’, so I filtered it out in the code.

The second problem was to do with CiteULike’s parsing. Sometimes, it would split a first+last name into separate names, probably due to incorrect manual entry at some point. I had to fix those at the source by editing corresponding CiteULike entry. Probably a good thing to do anyway.

The other problem is right out of the co-reference resolution domain. Sometimes names would include full first names, sometimes only a first name initial. I have worked around that by normalizing all first names to the initials. Obviously, this could collapse entries belonging to multiple real people into one.

Further on name problems, in cases of non English names (e.g. Spanish names with multiple surnames), CiteULike would get confused which part is which and not display or export it correctly. Additionally, sometimes characters such as ñ would be entered as plain n. Those also needed to be corrected manually.

The project only took a couple of hours including writing code and cleanup. It is already useful to me, as I found a new person who was in unexpectedly large number of papers and also found a chain of connections that might be interesting to follow more closely.

There is of course a lot more that could be done. Automatic co-reference of misspelt names, layout hints based on number of times authors appeared together, color coding of tags – these are just some of the easy ideas.

There might even be a small project/paper in doing co-reference resolution and cleaning up CiteULike data? After all, similar projects were done for Wikipedia. I don’t think CiteULike currently makes a full export available, but they do have some so might be amendable to exporting a special set for research purposes.