_{^{This picture originates from Unsplash, taken by the artist Pietro Jeng. This is put here to illustrate a neural network. :-)}}

Hey, Kristjan here!

Remember the post I made back in October -- if you don't, that's okay. I am not that famous anyway. But if you do, you might remember that I started my Bachelor's thesis about named-entity recognition (or NER for short) and oh boy do I have a lot to tell you about how the thesis is going. But first, a little reminder of the post in October.

Flashback

As you may recall, the topic of my thesis is named-entity in the 19th century borough court protocols. The country of Estonia dates back quite some time and there are literary scientists, who wish to learn what the courts were doing in the 19th century. This is all very interesting because not only can you see the development of the Estonian language, you can see the development of townsfolk and boroughs. For this, a database was created -- because the court protocols still exist (on paper), they could quite easily be digitised. This is what is happening as of right now -- almost a hundred people are writing exactly what they see on paper to a database consisting of a lot of protocols. And it is my duty to find the named entities in the protocols... or that's what I thought.

The gold standard 🏆

I was given a .zip file of the gold standard tagged files. This means that there are people, who have gone over every single borough court protocol and written down every single named entity by hand. This is the only way to achieve 100% correctedness -- computers haven't gotten that far yet.

As you may remember from my last post, the highest F₁ score a computer has gotten on NER is 0.9339. This means that for every 1000 named entities, about 934 of them would be recognised correctly.

Consider the gold standard a score of 1. The perfect score. What I had to do was use the Python library EstNLTK to tag the borough court texts with the NER-tagger so that we would have files with the gold standard tags. I essentially had the indices of tags in a file and had to show Python where the tags should be. This, however, was not an easy task -- since the gold standard was annotated using another tool other than what is usually used, the indices were off. Every newline character, which developers may recognise as \n counted for not one, but two character spaces. I had to count how many preceding newlines there were for each annotation and subtract them from the indices given in the file. Sounds easy enough, but there were a lot of edge-cases. Got through them though -- winning!

Machine learning 🖥️

As of right now, I am struggling to create new models based on the gold standard annotations. I mean, I am not struggling, I just don't quite understand how everything works yet. But I know that all of this has to do with machine learning. You essentially feed the computer your data, that you have gathered (also sometimes called "data mining"), and the computer learns the data and makes an algorithm for it.

A brief example:

Kersti Kaljulaid is the president of Estonia.

There are two named entities in this sentence -- Kersti Kaljulaid, who is a person (tagged PER) and Estonia, which is a location (tagged LOC). So the correct tagging would be:

[Kersti Kaljulaid]_PER is the president of [Estonia]_LOC.

To make matters more complicated, Kersti Kaljulaid is a two-word named entity, which means that to completely deconstruct the sentence would mean:

[[Kersti]_B-PER [Kaljulaid]_I-PER]_PER is the president of [Estonia]_LOC.

B-PER and I-PER stand for beginning-person and inside-person respectively.

If you give the computer this sentence, at first, it will have no idea what to do with it. It just looks like a bunch of zeroes and ones for the machine. But should you give it the data that there are PER and LOC tags where I showed you before, it will make sense of the sentence and return it.

Gotta use my brain for this 🧠

And so I sit here, evening by evening (because I have a 9-5 job now) and write code. Test the code (which takes an awful lot of time, approximately 2 hours to run a piece of code) and refactor it to make it faster, more efficient and better at learning. I have to say, this is a very fun task, but it sometimes gets very overwhelming, because some parts of the code have just been copied from the aforementioned Python library. I do have great communication with my instructor -- he helps me debug the code and since he is one of the authors of EstNLTK, he knows the code almost by heart.

It is almost the end of January, which means that I still have about 5 more months to write this code, test it, make experiments with it and eventually write it into a theoretical thesis, which should hopefully be enough to get an A. I really-really like this topic and hope that it will bring me a shower of "good job"-s. But first, I have to step up my game and write the code. Ugh.

So did I make your head dizzy? Or does my explanation sound understandable? Let me know down in the comments if you're familiar with machine learning and data science overall. What do you think of it?

Peace.

A lot of notebooks... and bugs (Bachelor's Thesis Part 2)