ArK: Augmented Reality with Knowledge Interactive Emergent Ability

*Equal Contribution,
1 Microsoft Research, Redmond 2 University of Washington 3 MILA 4 UCLA

The observations about emergent ability. The pipeline shows the emerging capabilities of large foundation models for the cross-modality in a requested unseen environment, and loaded the accountability in reality-agnostic scenario automatically with a new paradigm. We present an AI dominating demonstration of a system that enables interactive generation and editing of a Gaming/AR environment using a knowledge-enhanced style projection.


Despite the growing adoption of mixed reality and interactive AI agents, it remains challenging for these systems to generate high-quality 2D/3D scenes in unseen environments. The common practice requires deploying an AI agent to collect large amounts of data for model training for every new task. This process is costly, or even impossible, for many domains. In this study, we develop an interactive agent that learns to transfer knowledge-memory from general foundation models (e.g., GPT4, DALL-E) to novel domains or scenarios for scene understanding and generation in the physical or virtual world. The heart of our approach is an emerging mechanism, dubbed Augmented Reality with Knowledge Inference Interaction (ArK), which leverages knowledge-memory to generate scenes in unseen physical world and virtual reality environments. The knowledge interactive emergent ability (Figure 1) is demonstrated as the observation learns i) micro-action of cross-modality: in multi-modality models to collect a large amount of relevant knowledge-memory data for each interaction task (e.g., unseen scene understanding) from the physical reality; and ii) macro-behavior of reality-agnostic: in mix-reality environments to improve interactions that tailor to different characterized roles, target variables, collaborative information, and so on. We validate the effectiveness of ArK on the scene generation and editing tasks. We show that our ArK approach, combined with large foundation models, significantly improves the quality of generated 2D/3D scenes, compared to baselines, demonstrating the potential benefit of incorporating ArK in generative AI for applications such as metaverse and gaming simulation with emergent ability works invisible.

ARK Design

Training ARK

  • Pre-train infinite memory agent to retrieve relevant knowledge K for the given image-text pair.
  • Use VQA/WIT/COCO dataset to further train to provide question and answer (QA) for the image and text using retrieved knowledge K.
  • QA pairs are provided to large language models (GPT3.5) that generates new prompt for DALL-E.
  • Apply Reinforcement Learning with similarity between the DALL-E generated image feedback and original image as reward to train the agent to learn to ask relevant question and answer.
  • At test time, use generated 2D image by DALL-E to provide sumilation for 3D/gaming scene generation model (ChatGPT / GPT-4), whose spatial arrangement and program synthesis will be run in 3D engine.

Flowchart of ARK: Reinforcement Learning that Connects DALL-E Prior and GPT4/ChatGPT Program Synthesis Generation for 3D Scene Generation.

From 2D to 3D: Cross-Modality Scene Generation

ARK combines entity knowledge from the web and knowledge of 2D world to enhance 3D scene generation.


Humans prefer scenes generated by our ARK module than vanilla OpenAI models.

Qualitative Results


  author    = {Qiuyuan Huang and Jae Sung Park and Abhinav Gupta and Paul Bennett and Ran Gong and Subhojit Som and Baolin Peng and Owais Khan Mohammed and Chris Pal and Yejin Choi and Jianfeng Gao},
  title     = {ArK: Augmented Reality with Knowledge Interactive Emergent Ability},
  journal   = {arXiv preprint },
  year      = {2023},