General Image-to-Image Translation with One-Shot Image Guidance

Large-scale text-to-image models pre-trained on massive text-image pairs show excellent performance in image synthesis recently. However, image can provide more intuitive visual concepts than plain text. People may ask: how can we integrate the desired visual concept into an existing image, such as our portrait? Current methods are inadequate in meeting this demand as they lack the ability to preserve content or translate visual concepts effectively. Inspired by this, we propose a novel framework named visual concept translator (VCT) with the ability to preserve content in the source image and translate the visual concepts guided by a single reference image. The proposed VCT contains a content-concept inversion (CCI) process to extract contents and concepts, and a content-concept fusion (CCF) process to gather the extracted information to obtain the target image. Given only one reference image, the proposed VCT can complete a wide range of general image-to-image translation tasks with excellent results. Extensive experiments are conducted to prove the superiority and effectiveness of the proposed methods.

Authors: Bin Cheng*, Zuhao Liu*, Yunbo Peng, Yue Lin阅读全文

Learning Analytical Posterior Probability for Human Mesh Recovery

Despite various probabilistic methods for modeling the uncertainty and ambiguity in human mesh recovery, their overall precision is limited because existing formulations for joint rotations are either not constrained to SO(3) or difficult to learn for neural networks. To address such an issue, we derive a novel analytical formulation for learning posterior probability distributions of human joint rotations conditioned on bone directions in a Bayesian manner, and based on this, we propose a new posterior-guided framework for human mesh recovery. We demonstrate that our framework is not only superior to existing SOTA baselines on multiple benchmarks but also flexible enough to seamlessly incorporate with additional sensors due to its Bayesian nature.

Authors: Qi Fang, Kang Chen, Yinghui Fan, Qing Shuai, Jiefeng Li, Weidong Zhang阅读全文

MIGA: A Unified Multi-task Generation Framework for Conversational Text-to-SQL

Conversational text-to-SQL is designed to translate multi-turn natural language questions into their corresponding SQL queries. Most state-of-the-art conversational text- to-SQL methods are incompatible with generative pre-trained language models (PLMs), such as T5. In this paper, we present a two-stage unified MultI-task Generation frAmework (MIGA) that leverages PLMs' ability to tackle conversational text-to-SQL. In the pre-training stage, MIGA first decomposes the main task into several related sub-tasks and then unifies them into the same sequence-to-sequence (Seq2Seq) paradigm with task-specific natural language prompts to boost the main task from multi-task training. Later in the fine-tuning stage, we propose four SQL perturbations to alleviate the error propagation problem. MIGA tends to achieve state-of-the-art performance on two benchmarks (SparC and CoSQL). We also provide extensive analyses and discussions to shed light on some new perspectives for conversational text-to-SQL.

Author: Yingwen Fu, Wenjie Ou, Zhou Yu, Yue Lin阅读全文

CLIPVG: Text-Guided Image Manipulation Using Differentiable Vector Graphics

Considerable progress has recently been made in leveraging CLIP (Contrastive Language-Image Pre-Training) models for text-guided image manipulation. However, all existing works rely on additional generative models to ensure the quality of results, because CLIP alone cannot provide enough guidance information for fine-scale pixel-level changes. In this paper, we introduce CLIPVG, a text-guided image manipulation framework using differentiable vector graphics, which is also the first CLIP-based general image manipulation framework that does not require any additional generative models. We demonstrate that CLIPVG can not only achieve state-of-art performance in both semantic correctness and synthesis quality, but also is flexible enough to support various applications far beyond the capability of all existing methods.

Authors: Yiren Song, Xuning Shao, Kang Chen, Weidong Zhang, Minzhe Li, Zhongliang Jing阅读全文

GestureMaster: Graph-based Speech-driven Gesture Generation

This paper describes the GestureMaster entry to the GENEA (Generation and Evaluation of Non-verbal Behaviour for Embodied Agents) Challenge 2022. Given speech audio and text transcriptions, GestureMaster can automatically generate a high-quality gesture sequence to accompany the input audio and text transcriptions in terms of style and rhythm. GestureMaster system is based on the recent ChoreoMaster publication. ChoreoMaster can generate dance motion given a piece of music. We make some adjustments to ChoreoMaster to suit for the speech-driven gesture generation task. We are pleased to see that among the participating systems, our entry attained the highest median score in the human-likeness evaluation. In the appropriateness evaluation, we ranked first in upper-body study and second in full-body study.

Authors: Yinglin Duan, Tengyue Bian, Kang Chen阅读全文

Lamarckian Platform: Pushing the Boundaries of

Despite the emerging progress of integrating evolutionary computation into reinforcement learning, the absence of a high-performance platform endowing composability and massive parallelism causes non-trivial difficulties for research and applications related to asynchronous commercial games. Here we introduce Lamarckian - an open-source platform featuring support for evolutionary reinforcement learning scalable to distributed computing resources. To improve the training speed and data efficiency, Lamarckian adopts optimized communication methods and an asynchronous evolutionary reinforcement learning workflow. To meet the demand for an asynchronous interface by commercial games and various methods, Lamarckian tailors an asynchronous Markov Decision Process interface and designs an object-oriented software architecture with decoupled modules. In comparison with the state-of-the-art RLlib, we empirically demonstrate the unique advantages of Lamarckian on benchmark tests with up to 6000 CPU cores: i) both the sampling efficiency and training speed are doubled when running PPO on Google football game; ii) the training speed is 13 times faster when running PBT+PPO on Pong game. Moreover, we also present two use cases: i) how Lamarckian is applied to generating behavior-diverse game AI; ii) how Lamarckian is applied to game balancing tests for an asynchronous commercial game.

Authors: Hui Bai, Ruimin Shen, Yue Lin, Botian Xu, Ran Cheng阅读全文

Authors: Kewei Yang, Kang Chen, Daoliang Guo, Song-Hai Zhang, Yuan-Chen Guo, and Weidong Zhang阅读全文

SPatchGAN：A Statistical Feature Based Discriminator for Unsupervised Image-to-Image Translation

For unsupervised image-to-image translation, we propose a discriminator architecture which focuses on the statistical features instead of individual patches. The network is stabilized by distribution matching of key statistical features at multiple scales. Unlike the existing methods which impose more and more constraints on the generator, our method facilitates the shape deformation and enhances the fine details with a greatly simplified framework. We show that the proposed method outperforms the existing state-ofthe-art models in various challenging applications including selfie-to-anime, male-to-female and glasses removal.

Authors: Xuning Shao, Weidong Zhang阅读全文

MoCap-Solver: A Neural Solver for Optical Motion Capture Data

In a conventional optical motion capture (MoCap) workflow, two processes are needed to turn captured raw marker sequences into correct skeletal animation sequences. Firstly, various tracking errors present in the markers must be fixed (cleaning or refining). Secondly, an agent skeletal mesh must be prepared for the actor/actress, and used to determine skeleton information from the markers (re-targeting or solving). The whole process, normally referred to as solving MoCap data, is extremely time-consuming, labor-intensive, and usually the most costly part of animation production. Hence, there is a great demand for automated tools in industry. In this work, we present MoCap-Solver, a production-ready neural solver for optical MoCap data. It can directly produce skeleton sequences and clean marker sequences from raw MoCap markers, without any tedious manual operations. To achieve this goal, our key idea is to make use of neural encoders concerning three key intrinsic components: the template skeleton, marker configuration and motion, and to learn to predict these latent vectors from imperfect marker sequences containing noise and errors. By decoding these components from latent vectors, sequences of clean markers and skeletons can be directly recovered. Moreover, we also provide a novel normalization strategy based on learning a pose-dependent marker reliability function, which greatly improves system robustness. Experimental results demonstrate that our algorithm consistently outperforms the state-of-the-art on both synthetic and real-world datasets.

Authors: Kang Chen, Yupan Wang, Song-Hai Zhang, Sen-Zhe Xu, Weidong Zhang, Shi-Min Hu阅读全文

ChoreoMaster: Choreography-Oriented Music-Driven Dance Synthesis

Despite strong demand in the game and film industry, automatically synthesizing high-quality dance motions remains a challenging task. In this paper, we present ChoreoMaster, a production-ready music-driven dance motion synthesis system. Given a piece of music, ChoreoMaster can automatically generate a high-quality dance motion sequence to accompany the input music in terms of style, rhythm and structure. To achieve this goal, we introduce a novel choreography-oriented choreomusical embedding framework, which successfully constructs a unified choreomusical embedding space for both style and rhythm relationships between music and dance phrases. The learned choreomusical embedding is then incorporated into a novel choreography-oriented graph-based motion synthesis framework, which can robustly and efficiently generate high-quality dance motions following various choreographic rules. Moreover, as a production-ready system, ChoreoMaster is sufficiently controllable and comprehensive for users to produce desired results. Experimental results demonstrate that dance motions generated by ChoreoMaster are accepted by professional artists.

Authors: Kang Chen, Zhipeng Tan, Jin Lei, Songhai Zhang, Yuanchen Guo, Weidong Zhang, Shimin Hu阅读全文

SARG: A Novel Semi Autoregressive Generator for Multi-turn Incomplete Utterance Restoration

Dialogue systems in open domain have achieved great success due to the easily obtained single-turn corpus and the development of deep learning, but the multi-turn scenario is still a challenge because of the frequent coreference and information omission. In this paper, we investigate the incomplete utterance restoration which has brought general improvement over multi-turn dialogue systems in recent studies. Meanwhile, jointly inspired by the autoregression for text generation and the sequence labeling for text editing, we propose a novel semi autoregressive generator (SARG) with the high efficiency and flexibility. Moreover, experiments on two benchmarks show that our proposed model significantly outperforms the state-of-the-art models in terms of quality and inference speed.

Authors: Mengzuo Huang, Feng Li, Wuhe Zou, Weidong Zhang阅读全文