Google has made some huge strides in the world of artificial intelligence in recent years, helping computers read and understand dozens of world languages. But the company’s latest AI project shifts the focus from words to actions.
Recently Google released Atomic Visual Actions (AVA), a new dataset which analyzes human activity at the atomic level, using videos from YouTube (which Google owns). The company’s software engineers examined over 57,000 movie clips to find examples of 80 different actions, such as “walking,” “kicking” or “hitting,” as performed by actors of different nationalities.
“It is not that we regard this data as perfect, just that it is better than working with the assortment of user generated content such as videos of animal tricks, DIY instructional videos, events such as children’s birthday parties and the like,” Google engineers wrote in an accompanying research paper. “We expect movies to contain a greater range of activities as befits the telling of different kinds of stories.”
The team first divided each of the videos into 15-minute segments, and then partitioned them even further into 300 non-overlapping three-second “action shots.” Each localized clip both labeled the action and showed the context surrounding it, such as the environment and the actor’s facial expression.
Every activity was put into one of three groups: pose/movement, person-object interaction or person-person interaction. Google ended up labeling 96,000 different people and identifying 210,000 distinct examples of action.
By associating each activity with a specific person, AVA was able to analyze multiple actions at the same time and know who was doing each thing (i.e. person A was hitting while person 2 was kicking).
AVA’s dataset also included more complex actions that required two people, such as shaking hands. The algorithm was further able to recognize when one person was doing multiple things, such as hugging and kissing or singing while playing an instrument.
The engineering team wrote that AVA proves machines can have “understanding at the level of an individual agent’s actions.”
“These are essential steps toward imbuing computers with ‘social visual intelligence,'” their research paper concluded. “Understanding what humans are doing, what might they do next and what they are trying to achieve.”