Modeling Dynamic Social Vision Highlights Gaps Between Deep Learning and Humans
We present a dataset of natural videos and captions involving complex multi-agent interactions, and we benchmark 350+ image, video, and language models on behavioral and neural responses to the videos. Together these results identify a major gap in AI's ability to match the human brain and behavior and highlight the importance of studying vision in dynamic, natural contexts.