WRIGHT-PATTERSON AIR FORCE BASE, Ohio – As part of an increased commitment to autonomy research, a team from the Air Force Research Laboratory here entered and won the Large-Scale Movie Description Challenge at the 2017 International Conference on Computer Vision in Venice, Italy, Oct. 22-29, 2017.
“International open competitions such as the LSMDC provide an objective assessment of the latest state-of-the-art in cutting edge Artificial Intelligence technology,” said Dr. Vincent Velten, Decision Science Branch Technical Advisor at AFRL’s Multi-Domain Sensing Autonomy Division.
The goal of the LSMDC was to automatically generate a simple one sentence description of the actions or activities that occur in a 4-5 second video clip from a movie. Participants were given access to a training data set of clips and associated human generated sentences and were required to provide an algorithm for independent human evaluation against a blind test set of movie clips.
The AFRL team, comprised of Dr. Scott Clouse, senior research engineer at the Decision Science Branch; Oliver Nina, a PhD student from The Ohio State University and also a research intern on a Department of Defense Science, Mathematics and Research for Transformation, or SMART, Scholarship for Service fellowship at AFRL; and Nina’s advisor, Dr. Alper Yilmaz, also from OSU, were victorious over world leaders in Artificial Intelligence research such as Facebook AI research, the University of Toronto, and Ecole Polytechnic de Montreal.
“This result prominently places the AFRL team in the AI research field and demonstrates an advanced technology that is a key enabling component of Air Force autonomy goals,” said Velten. “This technique can eventually be used to automate the screening of video streams to alert operators to operationally important events for systems such as Predator/Reaper and Global Hawk,” he said.
“This year, humans, in the form of a three-judge panel, evaluated the submitted algorithms rather than computers as in previous years,” said Nina, who has been conducting summer research in support of the project at AFRL’s Autonomy Technology Research Center.
For people who are hearing or visually impaired, enjoying a commercial film sometimes requires additional support beyond the traditional format. They may be provided some kind of accessibility to that media. One of the means of doing that are audio descriptive services that provide sort of an audio book version of the film to go along with it so people can enjoy it, Clouse explained.
The goal of the LSMDC challenge was to produce a system that can turn such a film into this audio description format.
“Currently, they’re produced in kind of a theatrical way just like the film is where you have writers converting a script or a screenplay into more of a prose format,” said Clouse. “The reader then has to be skilled enough to convey the information in a more theatrical type of way. Then the movie dialogue plays along as it would normally, so they kind of have to interject as they go with the film. In addition to the dialogue, you want to describe what’s going on. That’s the point of descriptive services for people, who in particular, are visually impaired.”
The point is to generate these services in an automated way to cut down on the cost of generating the capability, Clouse explained.
“Because of the cost and time required to produce these kinds of descriptions, it is not easy to access them for a lot of different films and television shows,” said Clouse. “There are a very limited number of these available. It’s a very human intensive process to generate these materials so it’s the fundamental limitation of the throughput of people. If you can generate them automatically, then you’ve got a nice description as well as the audio that goes along with the film.”
The Air Force would like to similarly produce descriptions of video sequences that are captured from surveillance platforms or any kind of data feed, according to Clouse.
“Video is very popular with the sharp end of the Air Force because people very naturally deal with watching video and understanding what’s going on there,” added Velten. “However, there is a lot of it and not that many people to do the equivalent of this sort of function for a military application.”
Analysts may have to watch 50 hours of video to find 5 minutes of something interesting that’s militarily pertinent and matters for intelligence purposes or even for doing a special operations mission rehearsal, Velten said.
“This sort of technology would allow us to index clips, like you would in a library, and just show us the interesting parts. The nice thing is that there’s a civilian analogue to it and so there is a lot of great civilian research and the AFRL team showed they are at the forefront of that. The real motivation for the military is to be able to sort through enormous amounts of video and describe actions that are going on,” said Velten.
The team worked with 101,000 short video clips that were provided for this year’s competition. The algorithm they developed takes the video clips and produces a sort of abstract summary that is then translated into human-readable phrases.
“A great deal of computer crunching was required to do this and the team was able to use the super computer called Thunder at AFRL’s Department of Defense Supercomputing Resource Center,” Velten said. Thunder is part of the DOD High Performance Computing Modernization Program. “All the computation needed to develop the algorithm was done on Thunder, and without it, this research simply would not have been possible,” he said.
“We’re trying to mimic what the brain is doing, said Nina. “We can help the blind, or the visually impaired. Together we can reach great goals to help humanity and to help the Air Force defend our country,” he added.
Clouse indicated there are many significant improvements yet to be made, but with being able to take advantage of all the data and compute they have now, there’s a path to making leaps and bounds in improvements in a lot of different facets of life not yet achieved.
“The whole point is to produce systems that can have more human-like qualities in terms of their ability to not only produce output from fairly limited input, but also to produce output that human beings can trust. This is a very difficult problem,” said Velten.
“Obviously, this has enormous defense applications, but even larger societal and commercial applications. There are some potentially very impressive things right around the corner,” Velten added.