What does a platypus look like? Generating customized prompts for zero-shot image classification
Sarah Pratt, Ian Covert, Rosanne Liu, Ali Farhadi
This work combines open vocabulary models with large language models (LLMs) to create Customized Prompts via Language models (CuPL, pronounced "couple"). In particular, we leverage the knowledge contained in LLMs in order to generate many descriptive sentences that are used to perform zero-shot image classification with open vocabulary models. We find that this straightforward and general approach improves accuracy on a range of zero-shot image classification benchmarks, including over one percentage point gain on ImageNet. Finally, this simple baseline requires no additional training and remains completely zero-shot.
The Introspective Agent: Interdependence of Strategy, Physiology, and Sensing for Embodied Agents
2022    [ arxiv
]    [ code
Sarah Pratt, Luca Weihs, Ali Farhadi
While traditional embodied agents manipulate an environment to best achieve a goal, we argue for an introspective agent, which considers its own abilities in the context of its environment. We show that different environments yield vastly different optimal designs, and increasing long-term planning is often far less beneficial than other improvements, such as increased physical ability.
Grounded Situation Recognition
Spotlight at ECCV 2020    [ arxiv
]    [ code
]    [ demo
Sarah Pratt, Mark Yatskar , Luca Weihs, Ali Farhadi, Ani Kembhavi
Situation Recognition is the task of recognizing the activity happening in an image, the actors and objects involved in this activity, and the roles they play. Semantic roles describe how objects in the image participate in the activity described by the verb. While situation recognition addresses what is happening in an image, who is playing a part in this and what their roles are, it does not address a critical aspect of visual understanding: where the involved entities lie in the image. We address this shortcoming and present Grounded Situation Recognition (GSR), a task that builds upon situation recognition and requires one to not just identify the situation observed in the image but also visually ground the identified roles within the corresponding image.