I do a lot of problems on Project Euler. It is a really cool site full of mathematically-flavoured programming problems that are too difficult for me. My homepage right now keeps some live stats about my progress on Project Euler and some similar sites. Unfortunately, Project Euler doesn't provide any sort of public machine-readable data on how many problems you have solved. However, it does provide an image badge like the following:
I want to use OCR (optical character recognition) to parse the number of problems I have solved out of the above image, and allow it to be displayed on my homepage.
I'm going to use Python3 for this, because Python is easy, and speed isn't critical for this application (it will probably be run as a daily cronjob or something).
There are some nice articles out there that suggest tesseract-ocr is my best bet for some digit-recognition. pyocr is the most actively maintained wrapper for it that I can find as of early 2017. Installing this with
pip is trivial, although I had to play with Pillow installation first on my Ubuntu 14.04 VPS.
Let's get coding. You can see all the work in the
euler-foiler github repo (the repo name rhymes if you pronounce Euler correctly xo). There is not much to it. Here is the guts of the
parse function in
def parse(image_loc): tools = pyocr.get_available_tools() if len(tools) == 0: return "Error: No OCR tool found" # should be 'Tesseract (sh)' tool = tools orig_image = Image.open(image_loc) # crop to only the section with the number of problems solved cropped_image = orig_image.crop((47, 40, 97, 60)) # double the size of the image so the OCR has more to go on resized_image = cropped_image.resize((100, 40), Image.ANTIALIAS) digits = tool.image_to_string( resized_image, builder=pyocr.tesseract.DigitBuilder()) return digits
Most of the code is taken directly from the README of the
pyocr project. We explicitly declare that the script uses the
Tesseract (sh) tool, as it provides us with a model to only recognise digits.
We use a nice Pillow method to crop the Project Euler badge to only give us the part we care about (the number of problems solved). The hard-coded parameters are the left, top, right and bottom of the bounding box, and were figured out in the GIMP.
If we plug this into tesseract straight away, then it will not give us the right answer, as it is looking at a very small image (50x20 pixels). We therefore use Pillow to double the size of the cropped image.
We can then pop this into tesseract, specifying the digit-recognising model. We do not have to specify a language, as whatever the system default is will do for digits.
We're done! That wasn't nearly as complicated as I expected. The most difficult part was installing the dependencies.
To use this script, all you need to do is grab your badge (eg.
wget https://projecteuler.net/profiles/iandioch.png), run the script with the right path (
python3 parse.py iandioch.png), and you're away!
It accurately recognises my 128 problems solved right now. I look forward to it recognising higher numbers in future. I will work now to incorporate the data into my homepage, and now that I see OCR isn't so scary when I use nice libraries, I can apply it to a few other problems...