VL-Fields

Abstract

We present Visual-Language Fields (VL-Fields), a neural implicit spatial representation that enables open-vocabulary semantic queries. Our model encodes and fuses the geometry of a scene with vision-language trained latent features by distilling information from a language-driven segmentation model. VL-Fields is trained without requiring any prior knowledge of the scene object classes, which makes it a promising representation for the field of robotics. Our model outperformed the similar CLIP-Fields model in the task of semantic segmentation by almost 10%.

Encoding the scene photometry and geometry

VL-Fields jointly encodes the geometry and appearance of a scene, along with the visual-language features. This allows us rely only on the neural-fields for re-rendering the input video, without the need of a stored point-cloud (like in CLIP-Fields).

Related Work

There's a lot of excellent work for grounding language into neural implicit representations.

DFF introduced the idea of distilling knowledge from large language-vision models, for the purpose of grounding language into neural-fields.

CLIP-Fields demonstrated how such models can be used in the field of mobile robotics, for the purpose of commanding robots with natural language queries.

More recently, LERF addressed the limitations in utilizing fine-tuned VL models (e.g., LSeg), by directly extracting vision-language features from CLIP.

@article{tsagkas2023vlfields, title = {VL-Fields: Towards Language-Grounded Neural Implicit Spatial Representations}, author = {Tsagkas, Nikolaos and Mac Aodha, Oisin and Lu, Chris Xiaoxuan}, journal = {arXiv preprint arXiv:2305.12427}, year = {2023} }

VL-Fields: Towards Language-Grounded Neural Implicit Spatial Representations

Abstract

Open-Vocabulary Queries

Demo

Semantic segmentation

Training Pipeline

Encoding the scene photometry and geometry

Related Work

BibTeX