Figure: Face video editing. Our editing method shows improvement compared to the baseline (STIT) in terms of temporal consistency (left, “eyeglasses”) and robustness to the unusual case such as the hand-occluded face (right, “beard”).
Original
Latent Transformer
STIT
VideoEditGAN
Ours
Original
STIT
Ours
Original
STIT
Ours
Original
STIT
Ours
Original
STIT
Ours
Original
STIT
Ours
Original
STIT
12.0 sec/frame
Ours (T=1000)
62.4 sec/frame
Ours (T=100)
7.3 sec/frame
Ours (+Sampler)
2.9 sec/frame
If you find our work useful, please cite our paper:
@InProceedings{Kim_2023_CVPR,
author={Kim, Gyeongman and Shim, Hajin and Kim, Hyunsu and Choi, Yunjey and Kim, Junho and Yang, Eunho},
title={Diffusion Video Autoencoders: Toward Temporally Consistent Face Video Editing via Disentangled Video Encoding},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
month={June},
year={2023},
pages={6091-6100}
}