VGGT-HPE: Reframing Head Pose Estimation as Relative Pose Prediction

1Archimedes, Athena Research Center, Marousi, Greece     2HERON, Hellenic Robotics Center of Excellence, Athens, Greece     3Robotics Institute, Athena Research Center, Marousi, Greece
4School of ECE, National Technical University of Athens, Greece     5University of Pennsylvania
Figure 1 teaser for VGGT-HPE

Abstract

Monocular head pose estimation is traditionally formulated as direct regression from a single image to an absolute pose. This paradigm forces the network to implicitly internalize a dataset-specific canonical reference frame. In this work, we argue that predicting the relative rigid transformation between two observed head configurations is a fundamentally easier and more robust formulation. We introduce VGGT-HPE, a relative head pose estimator built upon a general-purpose geometry foundation model. Fine-tuned exclusively on synthetic facial renderings, our method sidesteps the need for an implicit anchor by reducing the problem to estimating a geometric displacement from an explicitly provided anchor with a known pose. As a practical benefit, the relative formulation also allows the anchor to be chosen at test time — for instance, a near-neutral frame or a temporally adjacent one — so that the prediction difficulty can be controlled by the application. Despite zero real-world training data, VGGT-HPE achieves state- of-the-art results on the BIWI benchmark, outperforming established absolute regression methods trained on mixed and real datasets. Through controlled easy- and hard-pair benchmarks, we also systematically validate our core hypothesis: relative prediction is intrinsically more accurate than absolute regression, with the advantage scaling alongside the difficulty of the target pose.

Motivation

  • 🔥 The model no longer needs to guess the hidden reference frame.
  • 🔥 It can do implicit feature matching between two visible head states.
  • 🔥 The anchor can be chosen at test time, so you can control difficulty by picking a near-neutral or temporally close frame.

Methodology

Our goal is to estimate the pose of a query head image by predicting its rigid transformation relative to an anchor image with known pose.

Figure 2 method overview for VGGT-HPE
1

Input Pair

Input: anchor image + query image.

2

Backbone

Backbone: VGGT camera branch.

3

Adaptation and Output

Adaptation: LoRA fine-tuning, backbone
mostly frozen. Output: relative transform Tq←a.

Results

VGGT-HPE achieves the lowest yaw and pitch errors among all methods, while being the only method trained exclusively on synthetic data.

Qualitative Results

Figure 4 qualitative BIWI results for VGGT-HPE

Qualitative results on BIWI. Each row shows a different subject. From left to right: the query frame, the anchor frame with its known pose overlay, the ground-truth pose, our prediction (VGGT-HPE), and three baselines (6DRepNet, TokenHPE, TRG).

Main BIWI Results

Table 1 presents the main cross-domain evaluation, split into reported numbers and reproduced results under a shared MTCNN protocol.

Method Yaw ↓ Pitch ↓ Roll ↓ MAE ↓ Data
Reported numbers
Dlib11.8613.0019.5614.81R
3DDFA5.5041.9013.2219.07M
EVA-GCN4.014.782.983.92M
HopeNet4.816.613.274.89M
QuatNet4.015.492.944.15M
Liu et al.4.125.613.154.29M
FSA-Net4.274.962.764.00M
HPE4.575.183.124.29M
WHENet-V3.604.102.733.48M
RetinaFace4.076.422.974.49R
FDN4.524.702.563.93M
MNN3.984.612.393.66M
TriNet3.054.764.113.97M
6DRepNet3.244.482.683.47M
Cao et al.4.213.523.103.61M
TokenHP3.954.512.713.72M
Cobo et al.4.584.652.713.98M
img2pose4.573.553.243.79M
PerspNet3.103.372.382.95R
TRG3.043.441.782.75M
VGGT-HPE (Rel., ours)2.243.043.172.82S
Reproduced under shared MTCNN protocol
6DRepNet3.744.953.043.91M
TokenHPE-v15.576.233.795.20M
TRG4.587.183.685.15M
VGGT-HPE-Abs (ours)4.907.013.535.15S
VGGT-HPE (Rel., ours)2.243.043.172.82S

R = real, M = mixed, S = synthetic.

Controlled Benchmarks: Easy vs Hard Pairs

To study how prediction difficulty scales with the anchor-target rotation gap, we construct two complementary benchmarks from BIWI.

Hard benchmark

Method Yaw ↓ Pitch ↓ Roll ↓ MAE ↓ Data
VGGT-HPE-Abs40.7418.6533.6631.02S
TokenHPE21.8526.3519.3422.51M
TRG8.9533.888.8717.23M
6DRepNet14.2718.916.8113.33M
VGGT-HPE (Rel., ours)3.8115.876.938.87S

BIWI hard benchmark (360 neutral-anchor / extreme-query pairs).

Easy benchmark

Method Yaw ↓ Pitch ↓ Roll ↓ MAE ↓ Data
VGGT-HPE-Abs5.622.264.063.98S
TokenHPE3.415.991.433.61M
TRG3.594.242.523.45M
6DRepNet2.314.931.142.80M
VGGT-HPE (Rel., ours)1.170.740.970.96S

BIWI easy neutral-anchor benchmark (360 pairs; pair delta mean: 3.82°).

Error Analysis

Figure 5 BIWI neutral-anchor evaluation

Figure 5. BIWI neutral-anchor evaluation as a function of anchor-query rotation gap. The upper plot reports rotation MAE, while the lower band shows the number of sampled pairs per bin.

Figure 6 BIWI query-pose evaluation

Figure 6. BIWI query-pose evaluation as a function of absolute query pose. For VGGT-HPE, each query is paired with a same-subject anchor whose anchor-query geodesic gap is below 5°.

Ablation Studies

Full fine-tuning destroys the pretrained representations and performs worst on both datasets. LoRA strikes the best balance, preserving the pretrained geometric priors while adapting to the facial domain.

Synthetic BIWI
Variant Yaw ↓ Pitch ↓ Roll ↓ MAE ↓ Yaw ↓ Pitch ↓ Roll ↓ MAE ↓
Adaptation strategy
Full finetune38.0033.7631.1234.2923.0717.9010.0517.00
From scratch9.2115.0714.0512.787.718.126.947.59
Head-only3.826.505.895.4018.0815.178.2513.83
LoRA (ours)2.464.654.513.872.243.043.172.82
Loss and formulation variants (all LoRA)
Small-Gap17.0323.2021.7720.667.387.073.886.11
Abs. Pair3.124.597.014.913.616.996.495.69
Abs. Single2.444.033.243.244.907.013.535.15
T-Aux, No FoV2.945.456.014.802.943.913.903.59
No FoV2.394.134.193.572.644.053.613.43
T-Aux2.434.374.483.762.763.103.903.25
Geo Loss2.283.993.763.342.953.243.323.17
Rot.-Only2.404.314.063.592.583.483.323.12
Baseline2.194.153.963.432.613.173.313.03
VGGT-HPE2.464.654.513.872.243.043.172.82

Performance naturally degrades compared to using the ground-truth anchor, but the drop is moderate when the anchor estimator is reasonably accurate.

Full BIWI with external anchor

Method Yaw ↓ Pitch ↓ Roll ↓ MAE ↓
VGGT-HPE (Rel., GT anchor)2.243.043.172.82
VGGT-HPE (Rel., VGGT-HPE-Abs anchor)4.565.283.424.42
VGGT-HPE (Rel., 6DRepNet anchor)2.896.053.294.08
VGGT-HPE (Rel., TokenHPE anchor)3.357.303.914.85
VGGT-HPE (Rel., TRG anchor)4.0212.037.137.73

Hard BIWI subset with external anchor

Method Yaw ↓ Pitch ↓ Roll ↓ MAE ↓
VGGT-HPE (Rel., GT anchor)3.8115.876.938.87
VGGT-HPE (Rel., VGGT-HPE-Abs anchor)6.9317.368.6810.99
VGGT-HPE (Rel., 6DRepNet anchor)4.8119.366.6210.26
VGGT-HPE (Rel., TokenHPE anchor)5.9220.317.4411.22
VGGT-HPE (Rel., TRG anchor)7.0721.8217.0015.30

Citation

@inproceedings{vasileiou2026vggthpe,
  title={VGGT-HPE: Reframing Head Pose Estimation as Relative Pose Prediction},
  author={Vasileiou, Vasiliki and Filntisis, Panagiotis P. and Maragos, Petros and Daniilidis, Kostas},
  booktitle={CVPR Workshop},
  year={2026}
}