
https://arxiv.org/abs/2408.00754 Coarse Correspondences Boost Spatial-Temporal Reasoning in Multimodal Language ModelMultimodal language models (MLLMs) are increasingly being applied in real-world environments, necessitating their ability to interpret 3D spaces and comprehend temporal dynamics. Current methods often rely on specialized architectural designs or task-speciarxiv.org기존 Multi-Modal은 ..