Using Microsoft Cognitive Services to perform OCR on images

Binary Adventures

Microsoft Cognitive Services (née Project Oxford) is a set of REST APIs that allow you to interface with machine learning algorithms. The APIs contain services which are divided into the categories Vision, Speech, Language and Knowledge.

Here’s how Microsoft describes it:

Microsoft Cognitive Services (formerly Project Oxford) are a set of APIs, SDKs and services available to developers to make their applications more intelligent, engaging and discoverable. Microsoft Cognitive Services expands on Microsoft’s evolving portfolio of machine learning APIs and enables developers to easily add intelligent features – such as emotion and video detection; facial, speech and vision recognition; and speech and language understanding – into their applications. Our vision is for more personal computing experiences and enhanced productivity aided by systems that increasingly can see, hear, speak, understand and even begin to reason.

There are free trials available with a limited number of monthly API calls, which should be more than sufficient for occasional use. You’ll need to create an account in order to get a subscription key.

In this post, we’ll be using the Computer Vision API to extract text from images, commonly referred to as Optical Character Recognition (OCR).

Say that we’ve been sent a screenshot of some text, something we’ve all experienced at some point. Usually we’ll settle for typing the text again ourselves, copying what’s captured in the image (including the necessary amount of grumbling). With the Vision API and the code below, we could take that screenshot, have it analyzed and receive the extracted text, ready to be used.

Retrieving the copied image data

The initial goal for the code was to be integrated into an Alfred workflow, where you’d have the image ready on the clipboard.

Retrieving anything but text from the (OS X) clipboard in Python seems to be a troublesome ordeal. The CLI command pbpaste and the Python libraries xerox and pyperclip only support text. Judging from posts on StackOverflow, it should be possible using PyObjC or Tkinter but I couldn’t get the former to install and the latter seems a bit much just to get access to the clipboard.

In the end I opted used the CLI utility pngpaste, which aims to do for binary data what pbpaste does for text. By using - as parameter instead of a filename, the binary data can be read from stdout into Python.

Image dimensions

The API’s requirements state that the image must be at least 40x40 pixels. If not, the server returns an HTTP 500 error. We’ll use Pillow to resize the canvas (not the image) to meet the requirements.

There are other requirements and limitations. Have a look at the API documentation to get the details.

Code

First, we’ll retrieve the image from the clipboard using pngpaste:

p = subprocess.run('./pngpaste -',
                   shell=True,
                   check=True,
                   stdout=subprocess.PIPE,
                   stderr=subprocess.PIPE)
img_data = p.stdout

Next, we’ll check the dimensions of the image to ensure both width and height are at least 40 pixels. As we read the image from stdout in the previous step, we need to wrap it in BytesIO in order to pass it to the Image.open() function.

img = Image.open(BytesIO(img_data))
if min(img.size) < 40:
    img = img.crop((0, 0, max(img.size[0], 40), max(img.size[1], 40)))

bin_img = BytesIO()
img.save(bin_img, format='PNG')
img.close()

img_data = bin_img.getvalue()
bin_img.close()

We’re now ready to pass the data to the API. This requires a POST HTTP call, passing the image data along with the necessary headers (e.g. the subscription key).

r = requests.post(api_url,
                  params=params,
                  headers=header,
                  data=img_data)

r.raise_for_status()

The raise_for_status() call will only raise an exception if we don’t receive a HTTP 200 status. When successful, the API will return a JSON object containing the extracted text.

Here’s an example of the data that would be returned (based on a sample provided by Microsoft):

{
    "language": "en",
    "textAngle": 0,
    "orientation": "Up",
    "regions": [
        {
            "boundingBox": "791,118,592,838",
            "lines": [
                {
                    "boundingBox": "793,750,498,59",
                    "words": [
                        {
                            "boundingBox": "793,750,498,59",
                            "text": "LITERALLY"
                        }
                    ]
                },
                {
                    "boundingBox": "791,821,390,60",
                    "words": [
                        {
                            "boundingBox": "791,821,390,60",
                            "text": "ASTOUND"
                        }
                    ]
                },
                {
                    "boundingBox": "795,893,531,63",
                    "words": [
                        {
                            "boundingBox": "795,893,531,63",
                            "text": "OURSELVES."
                        }
                    ]
                }
            ]
        }
    ]
}

Let’s retrieve the words from the JSON data (we’re not interested in anything else):

data = r.json()

text = ''
for item in r.json()['regions']:
    for line in item['lines']:
        for word in line['words']:
            text += ' ' + word['text']
        text += '\n'

Complete code

Remember that the original purpose of this code was to be used in an Alfred workflow. That’s the reason why e.g. we’re printing the parsed text (Alfred will pick up the data sent to standard output).

import requests
import subprocess
import sys
from PIL import Image
from io import BytesIO

api_url='https://westus.api.cognitive.microsoft.com/vision/v1.0/ocr'

header = {'Ocp-Apim-Subscription-Key': 'dummy_subscription_key',
          'Content-Type': 'application/octet-stream'}

params = {'language': 'unk'}

try:
    # Retrieve the binary image data from the clipboard
    p = subprocess.run('./pngpaste -',
                       shell=True,
                       check=True,
                       stdout=subprocess.PIPE,
                       stderr=subprocess.PIPE)
    img_data = p.stdout

    img = Image.open(BytesIO(img_data))
    # Ensure the image is at least 40x40
    if min(img.size) < 40:
        img = img.crop((0, 0, max(img.size[0], 40), max(img.size[1], 40)))

    bin_img = BytesIO()
    img.save(bin_img, format='PNG')
    img.close()

    img_data = bin_img.getvalue()
    bin_img.close()

    r = requests.post(api_url,
                      params=params,
                      headers=header,
                      data=img_data)

    r.raise_for_status()

    data = r.json()

    text = ''
    for item in r.json()['regions']:
        for line in item['lines']:
            for word in line['words']:
                text += ' ' + word['text']
            text += '\n'
    print(text)

except subprocess.CalledProcessError as e:
    print('Could not get image from clipboard: {}'.format(e), file=sys.stderr)

except requests.HTTPError as e:
    print('HTTP error occurred: {}'.format(e), file=sys.stderr)

except Exception as e:
    print('Error occurred: {}'.format(e), file=sys.stderr)