Pose-guided person image generation usually involves using paired source-target images to supervise the training, which significantly increases the data preparation effort and limits the application of the models. To deal with this problem, we propose a novel multi-level statistics transfer model, which disentangles and transfers multi-level appearance features from person images and merges them with pose features to reconstruct the source person images themselves. So that the source images can be used as supervision for self-driven person image generation. Specifically, our model extracts multi-level features from the appearance encoder and learns the optimal appearance representation through attention mechanism and attributes statistics. Then we transfer them to a pose-guided generator for re-fusion of appearance and pose. Our approach allows for flexible manipulation of person appearance and pose properties to perform pose transfer and clothes style transfer tasks. Experimental results on the DeepFashion dataset demonstrate our method’s superiority compared with state-of-the-art supervised and unsupervised methods. In addition, our approach also performs well in the wild.
Our model can be trained in a self-driven way without paired source-target images and flexibly controls the appearance and pose attributes to achieve pose transfer and clothes style transfer in inference. The images in (c) show the generated results using this model for simultaneous pose and cloths style transfer. Source A is transferred to the target pose, and its clothes are replaced by source B’s.
Appearance encoder extracts the features of the person image parts Ia parts by semantic segmentation map Sa. Pose encoder encodes the pose image Pa and pose connection map Pa_con and guides the Generator to synthesize the source posture. The MUST module disentangles and transfers multi-level appearance features, and the Generator fuses the multi-level appearance features and pose codes for reconstruction of the source person image Ia.