MIT researchers have developed a technique that teaches AI to capture actions shared between video and audio. For example, their method can understand that the act of a baby crying in a video is related to the spoken word “crying” in a sound clip. It’s part of an effort to teach AI how to understand concepts that humans have no trouble learning, but that computers find hard to grasp.  “The prevalent learning paradigm, supervised learning, works well when you have datasets that are well described and complete,” AI expert Phil Winder told Lifewire in an email interview. “Unfortunately, datasets are rarely complete because the real world has a bad habit of presenting new situations.”

Smarter AI

Computers have difficulty figuring out everyday scenarios because they need to crunch data rather than sound and images like humans. When a machine “sees” a photo, it must encode that photo into data it can use to perform a task like an image classification. AI can get bogged down when inputs come in multiple formats, like videos, audio clips, and images. “The main challenge here is, how can a machine align those different modalities? As humans, this is easy for us,” Alexander Liu, an MIT researcher and first author of a paper about the subject, said in a news release. “We see a car and then hear the sound of a car driving by, and we know these are the same thing. But for machine learning, it is not that straightforward.” Liu’s team developed an AI technique that they say learns to represent data to capture concepts shared between visual and audio data. Using this knowledge, their machine-learning model can identify where a specific action is taking place in a video and label it. The new model takes raw data, such as videos and their corresponding text captions, and encodes them by extracting features or observations about objects and actions in the video. It then maps those data points in a grid, known as an embedding space. The model clusters similar data together as single points in the grid; each of these data points, or vectors, is represented by an individual word. For instance, a video clip of a person juggling might be mapped to a vector labeled “juggling.” The researchers designed the model so it can only use 1,000 words to label vectors. The model can decide which actions or concepts it wants to encode into a single vector, but it can only use 1,000 vectors. The model chooses the words it thinks best represent the data. “If there is a video about pigs, the model might assign the word ‘pig’ to one of the 1,000 vectors. Then, if the model hears someone saying the word ‘pig’ in an audio clip, it should still use the same vector to encode that,” Liu explained.

Your Videos, Decoded

Better labeling systems like the one developed by MIT could help reduce bias in AI, Marian Beszedes, head of research and development at biometrics firm Innovatrics, told Lifewire in an email interview. Beszedes suggested the data industry can view AI systems from a manufacturing process perspective. “The systems accept raw data as input (raw materials), preprocess it, ingest it, make decisions or predictions and output analytics (finished goods),” Beszedes said. “We call this process flow the “data factory,” and like other manufacturing processes, it should be subject to quality controls. The data industry needs to treat AI bias as a quality problem. “From a consumer perspective, mislabeled data makes e.g. online search for specific images/videos more difficult,”  Beszedes added. “With correctly developed AI, you can do labeling automatically, much faster and more neutral than with manual labeling.” But the MIT model still has some limitations. For one, their research focused on data from two sources at a time, but in the real world, humans encounter many types of information simultaneously, Liu said “And we know 1,000 words work on this kind of dataset, but we don’t know if it can be generalized to a real-world problem,” Liu added.  The MIT researchers say their new technique outperforms many similar models. If AI can be trained to understand videos, you may eventually be able to skip watching your friend’s vacation videos and get a computer-generated report instead.