Interesting! Then what are your thoughts on papers such as the following that do ground large language models with interactions with the world?
I'm specifically curious about your thoughts on Zero-shot Common Sense and LM-Nav, since I don't think either are really grounded in the way you want, but they at least both appear to be a "truly intelligent robot moving around and performing [in this case, simple] tasks"
1. Extracting Zero-shot Common Sense from Large Language Models for Robot 3D Scene Understanding:
https://arxiv.org/abs/2206.04585
2. LM-Nav:
https://sites.google.com/view/lmnav
(Optional:)
3. Generally capable agents emerge from open-ended play:
https://www.deepmind.com/blog/generally-capable-agents-emerge-from-open-ended-play
4. Building interactive agents in video game worlds: https://www.deepmind.com/blog/building-interactive-agents-in-video-game-worlds