Scene-aware Activity Program Generation with Language Guidance

¹Shenzhen University ²Vivo ³Tencent AI Lab ⁴Carleton University
SIGGRAPH Asia 2023
^*Indicates Corresponding Author

Abstract

We address the problem of scene-aware activity program generation, which requires decomposing a given activity task into instructions that can be sequentially performed within a target scene to complete the activity. While existing methods have shown the ability to generate rational or executable programs, generating programs with both high rationality and executability still remains a challenge. Hence, we propose a novel method where the key idea is to explicitly combine the language rationality of a powerful language model with dynamic perception of the target scene where instructions are executed, to generate programs with high rationality and executability. Our method iteratively generates instructions for the activity program. Specifically, a two-branch feature encoder operates on a language-based and graph-based representation of the current generation progress to extract language features and scene graph features, respectively. These features are then used by a predictor to generate the next instruction in the program. Subsequently, another module performs the predicted action and updates the scene for perception in the next iteration. Extensive evaluations are conducted on the VirtualHome-Env dataset, showing the advantages of our method over previous work. Key algorithmic designs are validated through ablation studies, and results on other types of inputs are also presented to show the generalizability of our method.

Acknowledgements

We thank the anonymous reviewers for their valuable comments. This work was supported in parts by NSFC (62322207, U21B2023, U2001206),GDNatural Science Foundation (2021B1515020085), Shenzhen Science and Technology Program (RCYX20210609103121030), Tencent AI Lab Rhino-Bird Focused Research Program (RBFR2022013), NSERC Canada through a Discovery Grant and Guangdong Laboratory of Artificial Intelligence and Digital Economy (SZ).

BibTeX

@article{Su23LangGuidedProg, title={Scene-aware Activity Program Generation with Language Guidance}, author={Zejia Su and Qingnan Fan and Xuelin Chen and Oliver van Kaick and Hui Huang and Ruizhen Hu}, journal={ACM Transactions on Graphics (Proceedings of SIGGRAPH ASIA)}, volume={42}, number={6}, pages={}, year={2023}, }

Scene-aware Activity Program Generation with Language Guidance

Fig. 1. Our method generates scene-aware activity programs that are highly rational and executable. Here, we show three programs generated by our method, where a virtual avatar is instructed to perform various human activities in different scenes according to the input descriptions.

Abstract

Fig. 3. Two-branch feature encoding: An activity description and partial program (center column) are encoded into a language feature (left) through a pre-trained language model, and into a graph feature (right) through action embedding and node features extraction.

Fig. 4. Instruction generation: a graph-guided probability, languageguided probability, and human-centric probability are fused together to guide the selection of an object and action that compose a program instruction.

Fig. 6. Instruction execution and scene update: given the current scene graph, memory, and instruction, the instruction is executed to modify the graph, and the node features are updated in the memory-based feature update and then in the activity-aware feature propagation.

Activity Demonstrations

Acknowledgements

BibTeX