Modern vision-based language models face challenges with complex physical interactions, causal reasoning and intuitive psychology. Schulze Buschoff and colleagues demonstrate that while some models exhibit proficient visual data processing capabilities, they fall short of human performance in these cognitive domains.
- Luca M. Schulze Buschoff
- Elif Akata
- Eric Schulz