«

Parsing Project Euler Progress with PyOCR

I do a lot of problems on Project Euler. It is a really cool site full of mathematically-flavoured programming problems that are too difficult for me. My homepage right now keeps some live stats about my progress on Project Euler and some similar sites. Unfortunately, Project Euler doesn't provide any sort of public machine-readable data on how many problems you have solved. However, it does provide an image badge like the following:

I want to use OCR (optical character recognition) to parse the number of problems I have solved out of the above image, and allow it to be displayed on my homepage.

I'm going to use Python3 for this, because Python is easy, and speed isn't critical for this application (it will probably be run as a daily cronjob or something).

The Setup

There are some nice articles out there that suggest tesseract-ocr is my best bet for some digit-recognition. pyocr is the most actively maintained wrapper for it that I can find as of early 2017. Installing this with pip is trivial, although I had to play with Pillow installation first on my Ubuntu 14.04 VPS.

The Heist

Let's get coding. You can see all the work in the euler-foiler github repo (the repo name rhymes if you pronounce Euler correctly xo). There is not much to it. Here is the guts of the parse function in parse.py:

def parse(image_loc):  
    tools = pyocr.get_available_tools()
    if len(tools) == 0:
        return "Error: No OCR tool found"

    # should be 'Tesseract (sh)'
    tool = tools[0]

    orig_image = Image.open(image_loc)

    # crop to only the section with the number of problems solved
    cropped_image = orig_image.crop((47, 40, 97, 60))

    # double the size of the image so the OCR has more to go on
    resized_image = cropped_image.resize((100, 40), Image.ANTIALIAS)

    digits = tool.image_to_string(
        resized_image,
        builder=pyocr.tesseract.DigitBuilder())

    return digits

Most of the code is taken directly from the README of the pyocr project. We explicitly declare that the script uses the Tesseract (sh) tool, as it provides us with a model to only recognise digits.

We use a nice Pillow method to crop the Project Euler badge to only give us the part we care about (the number of problems solved). The hard-coded parameters are the left, top, right and bottom of the bounding box, and were figured out in the GIMP.

If we plug this into tesseract straight away, then it will not give us the right answer, as it is looking at a very small image (50x20 pixels). We therefore use Pillow to double the size of the cropped image.

We can then pop this into tesseract, specifying the digit-recognising model. We do not have to specify a language, as whatever the system default is will do for digits.

The Getaway

We're done! That wasn't nearly as complicated as I expected. The most difficult part was installing the dependencies.

If you look at the full file, you will see some extra bits to make sure the path the user gives is valid, etc. There is also a very small test script in the repo.

To use this script, all you need to do is grab your badge (eg. wget https://projecteuler.net/profiles/iandioch.png), run the script with the right path (python3 parse.py iandioch.png), and you're away!

It accurately recognises my 128 problems solved right now. I look forward to it recognising higher numbers in future. I will work now to incorporate the data into my homepage, and now that I see OCR isn't so scary when I use nice libraries, I can apply it to a few other problems...

Share Comment on Twitter