Failing signs of OCR

Inaccuracy of OCR making parser highlight the same question multiple times

Parser is a library I have build for splitting A-level question papers and reattaching portions of the a paper according to their question number. It works as expected on a multiple choice question paper which was initially used to build the whole thing — but it fails with other question papers which is weird since every question paper uses the same format created by Cambridge International Examination.

I initially thought the whole problem is with the detection of question number but it’s more like a major issue with the reattachment module which — is pretty hacked together in my opinion — it’s not really designed to scale but I didn’t expect it to be this worse.

I am considering rewriting the entire reattachment module and testing it on a variety of question papers. let’s see how it goes.

Okay so I was wrong, the problem is in fact with the detection module (which detects the question numbers from a page), not with the reattachment module. Still looking why is the program overlapping multiple questions. Turns out the hacked-together thing actually works better than the out-sourced library itself.

Update 1

Maybe throwing in the Keras MNIST AI to classify question numbers might do the trick here — probably better than the OCR library I am using right now. But first, I need to pinpoint that the main problem is really with the OCR and nothing else. Otherwise, I will be spending useless time working on an AI which produces the exact output as this one is currently doing.

Yep, I am pretty sure the problem is with the OCR. The detection module is working well but the OCR library is messing up. So, I am thinking of building an AI that does it for me — or perhaps hash out my old codes and hack together a simple one.

Update 2

This problem is pretty simple. Keras documentation already gives a tutorial for building a simple AI which can classify digits which I can use to check whether my detected image has a number in it or not and that way, I don’t have to use OCR. Perhaps it might slow down the program but in the long run, this is much more reliable than this OCR library I am using right now.

There is a problem — we cannot use the MNIST dataset — which would have made our job a whole lot easier because the digits range from 0 to 9 but we need the upper bound to be at least 100. I guess we can create a dataset for ourself but that is gonna take a lot of time.

Well, I guess I do not see any easier solution than this. I might have to spend another day building the AI. I could not find any resource on the internet which could lead me to my preferred dataset. So, I am going to embark upon building my own — again.

Originally published at on August 10, 2020.




Designed for those who want to efficiently teach and learn.

Love podcasts or audiobooks? Learn on the go with our new app.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Open Past Paper

Open Past Paper

Designed for those who want to efficiently teach and learn.

More from Medium

ASR(Speech Recognition) creation with Tranformers NN

Machine Learning Application in the Manufacturing Industry

Classifying Adult vs. Youth Anime Using Synopsis and Genre

Machine Learning in Medical Imaging: Is Your Machine Doctor Fair and Trustworthy?