空間感知的VLM模型;VLM的GUI網(wǎng)頁(yè)智能體;數(shù)學(xué)視頻QA基準(zhǔn);長(zhǎng)視頻計(jì)數(shù)基準(zhǔn)
RoboRefer:TowardsSpatialReferringwithReasoninginVisionLanguageModelsforRobotics20250604|BUAA,PKU,BAAI??32??http:arxiv.orgabs2506.04308v1??????https:huggingface.copapers2506.04308??????https:zhoues.github.ioRoboRefer??研究背景與意義問(wèn)題定義與現(xiàn)狀空間指稱是機(jī)器人理解并與三維物理世界交互的基礎(chǔ)能力。盡管現(xiàn)有預(yù)訓(xùn)練視覺(jué)語(yǔ)言模型(VLMs)在二維視覺(jué)任務(wù)上表現(xiàn)優(yōu)異,但它們?cè)趶?fù)雜三維場(chǎng)景的...