Parsing Library

3 min readJul 1, 2020

I have been working on the Parsing Library which can take input as a pdf and separate all the questions in that pdf in its topics. Well, topical questions classification is a tool on its own that will be integrated with the past paper maker after I finish building other tools.

Anyway, I have managed to crop out questions from a pdf file for which there are no libraries so, I had to code how to identify the number of questions in a given pdf using some OCR tool. Since I will be locally classifying the images instead of running them on a server, I don’t have to worry about it not being compatible if moved to other services.

Now, I am having an issue which I had seen during the day. The issue is regarding attaching multiple question pages into one single page so every question will have consistent sub-questions throughout. The problem was with identifying such part questions which are on the next page.

The way I was identifying the questions in the first place was by locating their question number and using OCR ( Optical Character Recognition ) to check if they are legitimate numbers or just some random alphabetical characters. In a way, it doesn’t identify a page as a questions page and declares it empty if it can’t find the question index which can be used as a way to find out the part sub-questions disjoint from the original question page.

I will have to check if the page is empty to not waste time on attaching an empty page — this can be done by reading what’s on that page. Loading every page in the script also allows the program to read the content in it so that is a possible way.

One good thing about A level past papers is that if a question is long and it might take a little space on the new page, they directly print the new question on the new page and leave that previous page empty — I used to think of this as a space for us to write and make rough works which are correct but also to make the pages more elegant and aesthetically beautiful. This way I don’t have to worry about leaving a portion of a question on other pages. If they are left, they will have the entire page and I can identify that or my program will detect a new page with a new question.

Easy. Let’s build this. I will be back after I finish it and fixing some bugs presumably.

There’s a bug!

There are technically 2 bugs — one is about the program stopping the processing of pdf files without reaching the end of the file — I am not sure how I am going to fix this one. The other is a runtime error due to the pdf processing library not identifying any pdf file in the memory — weird, the first thing the program literally does is load the pdf file so, what’s the issue.

Fixed the bug — I did it is a little unconventional and inefficient. The bug was simply a processing error of pages of a pdf file so what I did was to load the pdf, again and again, every time a new page is to be processed thus, the program sees this as a new pdf file and starts a new process. I mean it works. Maybe in the future, I will change the entire library again to fix this issue for the sake of efficient processing of pages.

Originally published at http://manishgotame.wordpress.com on July 1, 2020.

Parsing Library

Written by Open Past Paper