The teacher and the student
Our approach revolves around a concept called knowledge distillation, which uses a “teacher–student” model training method. We start with a “teacher” — a large, powerful, pre-trained generative model that is an expert at creating the desired visual effect but is far too slow for real-time use. The type of teacher model varies depending on the goal. Initially, we used a custom-trained StyleGAN2 model, which was trained on our curated dataset for real-time facial effects. This model could be paired with tools like StyleCLIP, which allowed it to manipulate facial features based on text descriptions. This provided a strong foundation. As our project advanced, we transitioned to more sophisticated generative models like Google DeepMind’s Imagen. This strategic shift significantly enhanced our capabilities, enabling higher-fidelity and more diverse imagery, greater artistic control, and a broader range of styles for our on-device generative AI effects.
The “student” is the model that ultimately runs on the user’s device. It needs to be small, fast, and efficient. We designed a student model with a UNet-based architecture, which is excellent for image-to-image tasks. It uses a MobileNet backbone as its encoder, a design known for its performance on mobile devices, paired with a decoder that utilizes MobileNet blocks.