Reflecting Reality: Enabling Diffusion Models to Produce Faithful Mirror Reflections

Ankit Dhiman 1,2*, Manan Shah 1*, Rishubh Parihar 1, Yash Bhalgat 3, Lokesh R Boregowda and R Venkatesh Babu 1

* Equal Contribution
1 Vision and AI Lab, IISc Bangalore
2 Samsung R & D Institute India - Bangalore
3 Visual Geometry Group, University of Oxford

We tackle the challenge of generating realistic mirror reflections using diffusion-based generative models, formulated as an image inpainting task to enable user control over mirror placement. To support this, we introduce SynMirror, a dataset with $198K$ samples rendered from $66K$ 3D objects, including depth maps, normal maps, and segmentation masks to capture scene geometry.

We propose MirrorFusion, a novel depth-conditioned inpainting method that produces high-quality, photo-realistic reflections, given an input image and mirror mask. MirrorFusion outperforms state-of-the-art methods on SynMirror, offering new possibilities for image editing and augmented reality.

Paper    Source Code    Dataset   

Introduction

The task of generating realistic and controllable mirror reflections remains a challenging one for various recent state-of-the-art diffusion based generative models. To illustrate this limitation, we prompt Stable Diffusion-2.1 with the instruction to generate a scene with a mirror reflection.

Prompt: A perfect plane mirror reflection of swivel chair with curved backrest in front of the mirror

Prompt: A perfect plane mirror reflection of a gold lipstick container in front of the mirror on a table

Prompt: A perfect plane mirror reflection of a black stone with swivels in front of the mirror on a table

From the above figure, it is clear that T2I methods fail to generate realistic and plausible mirror reflections. It can be seen that there is a lack of control over the placement of mirrors and what objects it reflects. Moreover, inpainting methods also fail to take the scene context into account while generating a plausible reflection when provided with an additional mask depicting the mirror region as input.

Prompt: All the images were generated by prefixing the mirror text prompt: "A perfect plain mirror reflection of " to the input object description.

Our model MirrorFusion, a diffusion-based inpainting model, is able to generate high-quality, geometrically consistent and photo-realistic mirror reflections given an input image and a mask depicting the mirror region. Our method shows superior quality generations as compared to previous state-of-the-art diffusion-based inpainting methods.

Dataset

We find that previous mirror datasets are inadequate for training generative models as they are primarily designed for reflective mirror detection and lack object diversity, which is required to incorporate the priors of mirror reflections in diffusion models.

To address this, we propose SynMirror, a first-of-its-kind large-scale synthetic dataset on mirror reflections, with diverse mirror types, objects, camera poses, HDRI backgrounds and floor textures.

Select Samples from SynMirror: . Use the slider to view RGB, Depth, Normal maps and Segmentation masks of the selected object.

RGB
Seg
RGB
Depth
RGB
Normal

SynMirror consists of samples rendered from 3D assets of two widely used 3D object datasets - Objaverse and Amazon Berkeley Objects (ABO).

We create a virtual environment in Blender by placing a selected 3D object in front of a mirror. We then leverage BlenderProc to render the 3D object along with its depth map, normal map, and segmentation mask. We render 3 random views per object, sampled along a trajectory around the object.

SynMirror dataset generation pipeline. We render $58,115$ objects sampled from Objaverse and all $7,953$ objects sampled from ABO.

Dataset Type Size Attributes
MSD Real 4018 RGB, Masks
Mirror-NeRF Real & Synthetic 9 scenes RGB, Masks, Multi-View
DLSU-OMRS Real 454 RGB, Mask
TROSD Real 11060 RGB, Mask
PMD Real 6461 RGB, Masks
RGBD-Mirror Real 3049 RGB, Depth
Mirror3D Real 7011 RGB, Masks, Depth
SynMirror (Ours) Synthetic 198204 RGB, Depth, Masks, Normals, Multi-View

A comparison between SynMirror and other mirror datasets. SynMirror has more attributes and is more than six times larger in size than all other existing datasets combined.

Method

We propose MirrorFusion, a novel depth-conditioned inpainting method that generates high-quality mirror reflections given an input image and a mask depicting the mirror region. The architecture of MirrorFusion is built upon BrushNet by incorporating a channel for depth, which is necessary for incorporating the geometric information of the object and its placement in the scene along with the mirror. MirrorFusion is fine-tuned on SynMirror from the Stable-Diffusion-v1.5 checkpoint. During inference, we provide the masked input image and a binary mask depicting the mirror region. The depth map can be estimated from the input image using any monocular depth estimation methods such as Marigold or Depth-Anything-V2.

Overview of the architecture. We encode the input image $X$ using a pre-trained image encoder from Stable Diffusion to get $Z_m$. Subsequently, we resize the mirror mask $m$ and depth map $d$ to obtain resized mask $X_m$ and depth $X_d$. Then, we concatenate noisy latents $Z_t$, $Z_m$, $X_m$, and $X_d$ which are fed into the Conditioning U-Net $\epsilon^{'}_{\theta}$. Each layer of the Generation U-Net $\epsilon_{\theta}$ is conditioned via zero convolutions with corresponding layers of $\epsilon^{'}_{\theta}$. Additionally, $\epsilon_{\theta}$ is conditioned by text embeddings. The pre-trained decoder then decodes the denoised latent to produce an image with mirror reflections.

Qualitative Results

We compare MirrorFusion with different state-of-the-art inpainting methods on MirrorBench, a held-out subset of SynMirror containing seen and unseen object categories.

Comparison with different inpainting methods.
We compare our results with zero-shot baselines (denoted by -ZS): SD-Inpainting-ZS, PowerPaint-ZS, and BrushNet-ZS. Additionally, we fine-tune BrushNet on SynMirror and refer to it as BrushNet-FT. The top four rows show results on the "unknown" category, while the bottom two rows display results on "known" categories from MirrorBench. Zero-shot methods often fail to generate reflections on the mirror or place them incorrectly. In contrast, BrushNet-FT, trained on SynMirror, produces plausible reflections but lacks geometric accuracy. However, MirrorFusion has improved accuracy in preserving object shapes, floor textures, and correctly placing the reflections.

Quantitative Results

We quantitatively compare MirrorFusion with BrushNet-FT on MirrorBench, which consists of $1497$ samples from known categories and $1494$ samples from unseen categores during training. We benchmark based on four aspects: masked region preservation, reflection generation quality, reflection geometry and text alignment. We generate 4 outputs using random seeds for each sample and report the average scores across MirrorBench by selecting the image with the best SSIM score over the unmasked region as the representative image.

Masked Image Preservation metrics are computed over the unmasked mirror region.

Masked Image Preservation Text Alignment
Models PSNR SSIM LPIPS CLIP Sim
Brushnet-FT 23.06 0.84 0.058 24.90
MirrorFusion (Ours) 24.22 0.84 0.051 25.23

Reflection Generation Quality metrics are computed over the segmentation mask containing the mirror reflection of the object and floor in the ground truth input image. To measure the reflection geometry, we compute the Intersection over Union (IoU) between the segmentation regions of ground truth object reflection and generated object reflection. We utilise SAM for segmenting the reflection of the object in the mirror.

Reflection Generation Quality Reflection Geometry
Models PSNR SSIM LPIPS IoU
Brushnet-FT 19.15 0.84 0.082 0.566
MirrorFusion (Ours) 20.35 0.84 0.075 0.567