Google’s New Software Can Caption Images With Scary Accuracy

The system uses machine-learning to identify what's happening in a photo, and translate it into text.

Tired of lackluster Google Images search results? The tech giant might be on the way to fixing that.

Google has created a machine-learning system that can take an image, and produce a caption accurately describing what is happening in that image, a recent post on Google’s blog explained.

Take this picture of pizza, for instance:



Google’s new software produced the following caption to describe it: “Two pizzas sitting on top of a stove top oven.” It missed the glass of wine and failed to mention that the pizzas look sadly picked-over, but hey, it’s still pretty impressive.

The blog post notes that plenty of past research has explored computers’ ability to identify and label objects. “But accurately describing a complex scene requires a deeper representation of what’s going on in the scene,” they write, “capturing how the various objects relate to one another and translating it all into natural-sounding language.”

Google got the idea for the project from pre-existing software that uses machine-learning to translate text from one language into another. In those programs, “a Recurrent Neural Network (RNN) transforms, say, a French sentence into a vector representation, and a second RNN uses that vector representation to generate a target sentence in German,” the post explains. Here’s a description of the rest of their process, for all the computer geeks out there:

Now, what if we replaced that first RNN and its input words with a deep Convolutional Neural Network (CNN) trained to classify objects in images? Normally, the CNN’s last layer is used in a final Softmax among known classes of objects, assigning a probability that each object might be in the image. But if we remove that final layer, we can instead feed the CNN’s rich encoding of the image into a RNN designed to produce phrases. We can then train the whole system directly on images and their captions, so it maximizes the likelihood that descriptions it produces best match the training descriptions for each image.

That all may sound like confusing tech-speak, but it sounds like the new technology could one day assist people in their everyday lives.

“This kind of system,” the post says, “could eventually help visually impaired people understand pictures, provide alternate text for images in parts of the world where mobile connections are slow, and make it easier for everyone to search on Google for images.”

In the meantime, you can check out this table showing some of the system’s other attempts to caption images:



[h/t the Wall Street Journal]