How We Score Sim-to-Real Transfer Confidence

The sim-to-real transfer confidence score is the most asked-about metric in Lucitra Validate. It answers a simple question: given this synthetic dataset, how likely is a model trained on it to perform well on real sensor data? This post explains how we compute it.

The Problem with Single Metrics

Most dataset quality tools produce a single "quality score" based on aggregate statistics — mean annotation IoU, class balance ratio, or image diversity index. These metrics are useful but insufficient for predicting real-world performance because they ignore the interactions between quality dimensions.

A dataset can have perfect annotation accuracy but terrible coverage: every label is correct, but the scenes only represent a single lighting condition. A dataset can have excellent coverage but systematic physics violations: the scenes are diverse, but objects float or interpenetrate in ways that never occur in reality.

Sim-to-real transfer depends on all of these dimensions simultaneously. Our scoring approach reflects that.

Scoring Architecture

The sim-to-real transfer confidence score is computed as a weighted ensemble of four component scores, each measuring a different aspect of dataset readiness.

1. Domain Alignment Score

This measures how closely the synthetic data's visual statistics match what a real sensor would produce. We analyze:

Color and luminance distributions compared to reference real-world datasets in the same domain (e.g., warehouse, outdoor, surgical)
Texture frequency spectra — synthetic data often has unrealistic texture smoothness that creates a domain gap
Noise characteristics — real sensors produce shot noise, read noise, and motion blur that synthetic data may omit
Dynamic range — the distribution of pixel intensities across highlight and shadow regions

The domain alignment score uses a learned distance metric trained on paired synthetic-real datasets where we know the actual transfer performance. This lets us predict transfer quality without requiring the user to provide real-world reference data.

2. Geometric Consistency Score

This evaluates whether the 3D structure implied by the rendered images is internally consistent and physically plausible:

Depth map validation using Depth Anything v2 to independently estimate depth from rendered RGB, then comparing against the ground-truth depth buffer from the simulator
Multi-view consistency — if the dataset includes multiple viewpoints of the same scene, are the implied 3D structures consistent?
Scale coherence — are object sizes physically reasonable given their semantic class?
Occlusion correctness — do partial occlusions match what the depth ordering predicts?

Geometric inconsistencies are one of the strongest predictors of transfer failure because they teach models spatial relationships that don't exist in reality.

3. Annotation Fidelity Score

This measures how accurately the provided labels match the rendered scene content:

Bounding box tightness — how closely do boxes match object silhouettes?
Segmentation mask boundary alignment — measured at the pixel level against rendered object masks
Label correctness — using Qwen2.5-VL to independently classify objects and compare against provided labels
Keypoint accuracy — for pose estimation datasets, are annotated keypoints at physically plausible locations?

Small annotation errors in synthetic data often go unnoticed because they're consistent (the same procedural bug affects every instance). But they create systematic biases that degrade real-world performance.

4. Distribution Readiness Score

This evaluates whether the dataset's statistical distribution prepares a model for the target deployment scenario:

Class frequency analysis against expected real-world priors
Pose and viewpoint coverage — are objects seen from a sufficient range of angles?
Contextual diversity — does the same object appear in varied backgrounds and lighting?
Edge case representation — are rare but important scenarios (occlusions, unusual poses, failure modes) adequately represented?

Combining the Scores

The four component scores are combined using a learned weighting function rather than fixed weights. The weights depend on:

Dataset domain — for outdoor autonomous driving, domain alignment matters more; for bin-picking, geometric consistency dominates
Dataset size — smaller datasets are penalized more heavily for distribution gaps
Intended model architecture — object detection models are more sensitive to annotation fidelity than scene classification models

The final sim-to-real transfer confidence score is a value between 0 and 1, where:

0.90+ indicates high confidence in successful transfer
0.70–0.89 indicates likely transfer with some degradation
0.50–0.69 indicates significant risk of transfer failure
Below 0.50 indicates the dataset needs substantial improvement before training

Calibration

We calibrate the score against measured transfer performance: for each dataset in our calibration set, we train a reference model, evaluate it on real data, and compare the predicted score against actual performance. The current calibration achieves a rank correlation of 0.87 with actual transfer performance across our benchmark suite.

The score is not a guarantee. It's a probabilistic estimate based on dataset properties that have historically predicted transfer success. We report confidence intervals alongside the point estimate.

The sim-to-real transfer confidence score is designed to be the single number you check before committing GPU hours to training. The component scores tell you where to focus when that number isn't where you need it to be.