So I have been working on the marking scheme extractor (for lack of a better word). This problem is not that hard because I have already solved the Pdf localization and extraction problem (although there are a few bugs to take care of) so I used the knowledge and code from there to slightly modify it to work on the marking scheme extractor. This extractor has to work with a file generated by the splitter to extract specifically those answers to the questions from the splitter.
I am having some issues with the sharing of the data from the splitter to the extractor though. It’s that I can’t really figure out how to shape the data so that the extractor can easily use it to extract all the answers. I wrote the first version of the extractor today which is pretty bad but it works on every past paper up to 2016. After 2016, the format has been completely changed but is very easier for us to extract — they’ve written the paper in tabular form — this is far easier than trying to calculate the position of question number on the paper.
I will work on the extractor again tomorrow and also try to figure out how to make the metadata sharable among these two programs but smoother use.
There was a bug in Parser.
Well, there are a lot of bugs in Parser but this is an important one to be fixed. When a question number like 1, 2, 3 is found, the program thinks it holds a line of sentences which is true in most papers. But sometimes, there comes a few papers that have the question number and the sub-question number like (a), (b), © on the same line. This is where the program thinks that they are two different blocks of questions so it treats them differently but the output is two exactly the same images which we don’t want.
So, I tried to solve the bug today. I think it’s easy since we are already passing the metadata to the image extractor too so, the distance from the top of the page to the characters is exactly the same so, that solved it. Let’s move on to another one.
Done with the second version of the marking scheme extractor. As if building the Parser over and over again was not already enough, now, I have to repeat it for the marking scheme splitter too. I am kind of fed with the issues I have been solving — they’re more of a patch than a fully-fledged solution to the problem I am facing right now. But for the basic solution, this will suffice.