Scene-aware Activity Program Generation with Language Guidance

1Shenzhen University 2Vivo 3Tencent AI Lab 4Carleton University
SIGGRAPH Asia 2023
*Indicates Corresponding Author
MY ALT TEXT

Fig. 1. Our method generates scene-aware activity programs that are highly rational and executable. Here, we show three programs generated by our method, where a virtual avatar is instructed to perform various human activities in different scenes according to the input descriptions.

Abstract

We address the problem of scene-aware activity program generation, which requires decomposing a given activity task into instructions that can be sequentially performed within a target scene to complete the activity. While existing methods have shown the ability to generate rational or executable programs, generating programs with both high rationality and executability still remains a challenge. Hence, we propose a novel method where the key idea is to explicitly combine the language rationality of a powerful language model with dynamic perception of the target scene where instructions are executed, to generate programs with high rationality and executability. Our method iteratively generates instructions for the activity program. Specifically, a two-branch feature encoder operates on a language-based and graph-based representation of the current generation progress to extract language features and scene graph features, respectively. These features are then used by a predictor to generate the next instruction in the program. Subsequently, another module performs the predicted action and updates the scene for perception in the next iteration. Extensive evaluations are conducted on the VirtualHome-Env dataset, showing the advantages of our method over previous work. Key algorithmic designs are validated through ablation studies, and results on other types of inputs are also presented to show the generalizability of our method.

MY ALT TEXT

Fig. 2. Overview of our activity program generation method, which takes as input (a) an activity description and a scene with a human agent indicated by the yellow H node. (b) At each iteration, the two-branch feature encoding module first encodes the current state information, including the scene state, partial program, and description of the desired activity, to guide the instruction generation module in predicting the next instruction and target object. Finally, the instruction execution and scene update module updates the scene topology and properties based on the execution of the predicted instruction. The method iterates until (c) the final result with the full program and corresponding motion sequence is generated.

MY ALT TEXT

Fig. 3. Two-branch feature encoding: An activity description and partial program (center column) are encoded into a language feature (left) through a pre-trained language model, and into a graph feature (right) through action embedding and node features extraction.

MY ALT TEXT

Fig. 4. Instruction generation: a graph-guided probability, languageguided probability, and human-centric probability are fused together to guide the selection of an object and action that compose a program instruction.

MY ALT TEXT

Fig. 6. Instruction execution and scene update: given the current scene graph, memory, and instruction, the instruction is executed to modify the graph, and the node features are updated in the memory-based feature update and then in the activity-aware feature propagation.

Activity Demonstrations

Acknowledgements

We thank the anonymous reviewers for their valuable comments. This work was supported in parts by NSFC (62322207, U21B2023, U2001206),GDNatural Science Foundation (2021B1515020085), Shenzhen Science and Technology Program (RCYX20210609103121030), Tencent AI Lab Rhino-Bird Focused Research Program (RBFR2022013), NSERC Canada through a Discovery Grant and Guangdong Laboratory of Artificial Intelligence and Digital Economy (SZ).

BibTeX


@article{Su23LangGuidedProg,
         title={Scene-aware Activity Program Generation with Language Guidance},
         author={Zejia Su and Qingnan Fan and Xuelin Chen and Oliver van Kaick and Hui Huang and Ruizhen Hu},
         journal={ACM Transactions on Graphics (Proceedings of SIGGRAPH ASIA)},
         volume={42},
         number={6},
         pages={},
         year={2023},
        }