Parsing Library

Open Past Paper
3 min readJul 1, 2020


I have been working on the Parsing Library which can take input as a pdf and separate all the questions in that pdf in its topics. Well, topical questions classification is a tool on its own that will be integrated with the past paper maker after I finish building other tools.

Anyway, I have managed to crop out questions from a pdf file for which there are no libraries so, I had to code how to identify the number of questions in a given pdf using some OCR tool. Since I will be locally classifying the images instead of running them on a server, I don’t have to worry about it not being compatible if moved to other services.

Now, I am having an issue which I had seen during the day. The issue is regarding attaching multiple question pages into one single page so every question will have consistent sub-questions throughout. The problem was with identifying such part questions which are on the next page.

The way I was identifying the questions in the first place was by locating their question number and using OCR ( Optical Character Recognition ) to check if they are legitimate numbers or just some random alphabetical characters. In a way, it doesn’t identify a page as a questions page and declares it empty if it can’t find the question index which can be used as a way to find out the part sub-questions disjoint from the original question page.

I will have to check if the page is empty to not waste time on attaching an empty page — this can be done by reading what’s on that page. Loading every page in the script also allows the program to read the content in it so that is a possible way.

One good thing about A level past papers is that if a question is long and it might take a little space on the new page, they directly print the new question on the new page and leave that previous page empty — I used to think of this as a space for us to write and make rough works which are correct but also to make the pages more elegant and aesthetically beautiful. This way I don’t have to worry about leaving a portion of a question on other pages. If they are left, they will have the entire page and I can identify that or my program will detect a new page with a new question.

Easy. Let’s build this. I will be back after I finish it and fixing some bugs presumably.

There’s a bug!

There are technically 2 bugs — one is about the program stopping the processing of pdf files without reaching the end of the file — I am not sure how I am going to fix this one. The other is a runtime error due to the pdf processing library not identifying any pdf file in the memory — weird, the first thing the program literally does is load the pdf file so, what’s the issue.

Fixed the bug — I did it is a little unconventional and inefficient. The bug was simply a processing error of pages of a pdf file so what I did was to load the pdf, again and again, every time a new page is to be processed thus, the program sees this as a new pdf file and starts a new process. I mean it works. Maybe in the future, I will change the entire library again to fix this issue for the sake of efficient processing of pages.

Originally published at on July 1, 2020.



Open Past Paper

Designed for those who want to efficiently teach and learn.