We tackle the challenge of generating realistic mirror reflections using diffusion-based generative models, formulated as an image inpainting task to enable user control over mirror placement. To support this, we introduce SynMirror, a dataset with $198K$ samples rendered from $66K$ 3D objects, including depth maps, normal maps, and segmentation masks to capture scene geometry.
We propose MirrorFusion, a novel depth-conditioned inpainting method that produces high-quality, photo-realistic reflections, given an input image and mirror mask. MirrorFusion outperforms state-of-the-art methods on SynMirror, offering new possibilities for image editing and augmented reality.
Cool Podcast Generated by NotebookLM!
The task of generating realistic and controllable mirror reflections remains a challenging one for various recent state-of-the-art diffusion based generative models. To illustrate this limitation, we prompt Stable Diffusion-2.1 with the instruction to generate a scene with a mirror reflection.
From the above figure, it is clear that T2I methods fail to generate realistic and plausible mirror reflections. It can be seen that there is a lack of control over the placement of mirrors and what objects it reflects. Moreover, inpainting methods also fail to take the scene context into account while generating a plausible reflection when provided with an additional mask depicting the mirror region as input.
Our model MirrorFusion, a diffusion-based inpainting model, is able to generate high-quality, geometrically consistent and photo-realistic mirror reflections given an input image and a mask depicting the mirror region. Our method shows superior quality generations as compared to previous state-of-the-art diffusion-based inpainting methods.
We find that previous mirror datasets are inadequate for training generative models as they are primarily designed for reflective mirror detection and lack object diversity, which is required to incorporate the priors of mirror reflections in diffusion models.
To address this, we propose SynMirror, a first-of-its-kind large-scale synthetic dataset on mirror reflections, with diverse mirror types, objects, camera poses, HDRI backgrounds and floor textures.
SynMirror consists of samples rendered from 3D assets of two widely used 3D object datasets - Objaverse and Amazon Berkeley Objects (ABO).
We create a virtual environment in Blender by placing a selected 3D object in front of a mirror. We then leverage BlenderProc to render the 3D object along with its depth map, normal map, and segmentation mask. We render 3 random views per object, sampled along a trajectory around the object.
Dataset | Type | Size | Attributes |
---|---|---|---|
MSD | Real | 4018 | RGB, Masks |
Mirror-NeRF | Real & Synthetic | 9 scenes | RGB, Masks, Multi-View |
DLSU-OMRS | Real | 454 | RGB, Mask |
TROSD | Real | 11060 | RGB, Mask |
PMD | Real | 6461 | RGB, Masks |
RGBD-Mirror | Real | 3049 | RGB, Depth |
Mirror3D | Real | 7011 | RGB, Masks, Depth |
SynMirror (Ours) | Synthetic | 198204 | RGB, Depth, Masks, Normals, Multi-View |
We propose MirrorFusion, a novel depth-conditioned inpainting method that generates high-quality mirror reflections given an input image and a mask depicting the mirror region. The architecture of MirrorFusion is built upon BrushNet by incorporating a channel for depth, which is necessary for incorporating the geometric information of the object and its placement in the scene along with the mirror. MirrorFusion is fine-tuned on SynMirror from the Stable-Diffusion-v1.5 checkpoint. During inference, we provide the masked input image and a binary mask depicting the mirror region. The depth map can be estimated from the input image using any monocular depth estimation methods such as Marigold or Depth-Anything-V2.
We compare MirrorFusion with different state-of-the-art inpainting methods on MirrorBench, a held-out subset of SynMirror containing seen and unseen object categories.
We quantitatively compare MirrorFusion with BrushNet-FT
on MirrorBench, which consists of
$1497$ samples from known categories and $1494$ samples from unseen categores during training.
We benchmark based on four aspects: masked region preservation, reflection generation quality, reflection geometry and
text alignment. We generate 4 outputs using random seeds for each sample and report the average scores across MirrorBench by selecting
the image with the best SSIM score over the unmasked region as the representative image.
Masked Image Preservation metrics are computed over the unmasked mirror region.
Masked Image Preservation | Text Alignment | |||
---|---|---|---|---|
Models | PSNR ↑ | SSIM ↑ | LPIPS ↓ | CLIP Sim ↑ |
Brushnet-FT | 23.06 | 0.84 | 0.058 | 24.90 |
MirrorFusion (Ours) | 24.22 | 0.84 | 0.051 | 25.23 |
Reflection Generation Quality metrics are computed over the segmentation mask containing the mirror reflection of the object and floor in the ground truth input image. To measure the reflection geometry, we compute the Intersection over Union (IoU) between the segmentation regions of ground truth object reflection and generated object reflection. We utilise SAM for segmenting the reflection of the object in the mirror.
Reflection Generation Quality | Reflection Geometry | |||
---|---|---|---|---|
Models | PSNR ↑ | SSIM ↑ | LPIPS ↓ | IoU ↑ |
Brushnet-FT | 19.15 | 0.84 | 0.082 | 0.566 |
MirrorFusion (Ours) | 20.35 | 0.84 | 0.075 | 0.567 |