Accurate data localization and extraction

3 min readAug 26, 2020

Maybe I should try to rebuild the PDF processing library so I can have full control over what data I want for this project. The digit classifier is not that good, the whole idea of using OCR in pdf is not reliable. I don’t know whether I will do that now or in the future for improvement. But it is really frustrating to work with images and using sloppy algorithms to locate the questions — they are not a hundred percent accurate.

If a pdf can be split within a single page, surely it can read and track line by line. I just need to find out how the library does this. These libraries are open-sourced so, I can maybe mess around them to find out how they work.

Update 1

Perhaps I will not have to “reinvent the wheel” because someone else has done some work towards this problem. It is a library I found while I was looking for some answers on Stack Overflow but it lit my face up with joy when I saw what it could do with pdf and images.

It read the pdf line by line — locates the position of the text being read on the page, and then forms a box around it and makes an image — which is exactly what I am looking for. But wait, I still have not messed around with it in code which will decide its functionality. But that is what I have read from its GitHub repo and it looks very promising.

I am surprised, impressed, and dumbfounded at the same time as this works more reliably than any other PDF processing library on the internet and dumbfounded because how come I did not find this earlier?

Whoever this guy is, who built this library is gonna help me solve this problem so easily. I can’t be more thankful.

The problem with the second iteration of Parser / Parsing Library is that it heavily relies on image processing to extract the questions. Especially detecting and localization of question number is not 100% accurate due to which it may work on one pdf but will fail on other pdfs. But this library literally gives me the font characteristics, height, width, position on the pdf of every single character on a page. How cool is that!

Filtering the question number using the font is pretty simple. using the position value on the pdf to extract the images. 100% accuracy on every single pdf. This is madness. I will work on this tonight and figure out everything. This is exhilarating.

Furthermore, this can allow us to separate questions not just by their question number but also their sub-questions since A-level questions are built using knowledge from different topics. This is amazing. What a great day.

I think I will have to look into this library to actually break apart and use whatever is useful. To include this in my own project. I don’t know but I will decide this later.

Originally published at http://manishgotame.wordpress.com on August 26, 2020.

Accurate data localization and extraction

Update 1

Written by Open Past Paper