School of Interactive Computing Ph.D. student Abhishek Das remembers the moment his interests in computer vision and language began to come into focus. It was early in his time as a Ph.D. student when he came across an algorithm that could generate a one-line natural language description of an image with incredible accuracy. When he saw the results, it seemed almost magical, he said.
“I was blown away because you could give it any image, and it would generate a fairly plausible sentence,” he said. “I had never seen that before.”
Six months later, there were papers being published on question answering, where the algorithm could not only generate a sentence but could even answer questions about the image. He was similarly floored by the impressive results.
He was advised by Dhruv Batra and also working closely with Devi Parikh, both assistant professors at Virginia Tech at the time. When they joined Georgia Tech, Das brought his thirst for research in that space to Atlanta, as well. Now, nearly two years later, he has published a number of research papers in projects ranging from visual dialogue to a task called “embodied question answering.” He is working toward additional research involving multiple agents, and sees a world not far off that takes advantage of all of this simulated research to develop hardware for assistive tech like in-home robots.
'It feels within reach...'
It’s a future that has been featured in popular culture for years – think about Rosie, the robot maid who first appeared on The Jetsons in 1962 – but is one that Das is beginning to see on the horizon.
“It feels within reach, the vision that we see in science fiction,” he said. “Movies of robots that you can talk to or give instructions to.”
While people outside of the research sphere may just see the cold steel exterior of these imagined robots, it requires so many different elements to develop a viable foundation. This includes work in computer vision, which involves analysis of visual information by a machine, and language, which involves written or verbal communication and instruction. Das works at the intersection of both domains.
Broadly, his research has been in developing algorithms and intelligent agents that can see, talk, and ultimately act on that understanding in physical environments, taking actions like navigation or executing instructions.
Findings from a recent research project were published and presented at the 2018 Computer Vision and Pattern Recognition conference in Salt Lake City, Utah. They explored an idea called embodied question answering. In this project, there is an agent that is asked a question and must ascertain an answer by moving through and inquiring about other aspects of its environment.
“It combines these three modalities: computer vision, language understanding, and reinforcement learning to take actions in this environment,” Das said.
The application here could be an assistive robot that could take a question or a command – “Where are my keys,” for example – and provide an answer or perform a task based on its understanding of the environment. He’s also conducting similar work with multiple agents, which could help coordinate to perform certain tasks.
“I’m not currently working with the hardware side of things,” he said. “All of this is simulation, but these are the end goals. The vision is that these will make it to robots with these sorts of capabilities. And, more importantly, the algorithms that I’m building will hopefully generalize and be useful for a wide variety of tasks.”
A culture of collaboration
Das’ work has received extensive media attention and he has had the opportunity to work under some prestigious grants and fellowships. Currently, he is supported by fellowships from Facebook, Adobe, and Snap. He was recently awarded fellowships from Facebook, Microsoft Research, NVIDIA. He declined the latter two and accepted Facebook.
One of the great benefits, he said, of working at Georgia Tech in this space has been the opportunity to collaborate with individuals who are conducting research in complementary domains.
“On my floor in the College of Computing, there are people who are experts in computer vision, natural language processing, reinforcement learning, in robotics, or other areas, and it’s always awesome to bounce ideas off of them,” he said.
“Just this semester, I was taking (Associate Professor) Sonia Chernova’s course in human-robot interaction, and we prototyped a version of a tabletop embodied robot that could actually implement a very primitive version of the embodied question answering algorithm. That was a very interesting experience.”
Das is gaining new valuable experience this semester, as well. Having interned three times at Facebook AI Research, he is spending this semester in London interning with DeepMind, where he will work in areas related to this general space of agents that can see, talk, and act.