ArtFace: Towards Historical Portrait Face Identification via Model Adaptation

CLIP Foundation Model: CLIP is a vision-language foundation model trained on 400 million image-text pairs. It learns to align images and text in a shared embedding space, enabling zero-shot transfer to a wide range of tasks. Since foundation models are trained on diverse data, they capture broad concepts, detect stylistic differences, and can use contextual information relevant to paintings that traditional facial recognition overlooks. For our experiments, we use the CLIP ViT-B/16 model from OpenAI, selected for its balance between tunability and computational efficiency.

Face Recognition Model: AntelopeV2 is a state-of-the-art face recognition (FR) model built on the IResNet100 architecture and trained on the large-scale Glint360k dataset. It achieves high accuracy and robustness across challenging variations in pose, age, and lighting, making it a strong and widely adopted baseline in face verification benchmarks. We also use a commercial off-the-shelf (COTS) face recognition model as another baseline.

Model Adaptation: Both the CLIP and face recognition (FR) models are adapted using portrait images to better align with the domain of historical paintings. For CLIP, we apply Low-Rank Adaptation (LoRA), inserting LoRA layers into the query (Q) and value (V) matrices of the attention mechanism. This approach introduces low-rank updates, allowing efficient fine-tuning with significantly fewer trainable parameters and reduced memory usage, while maintaining performance. For the IResNet100 model, adaptation is minimal; only the final linear layer is fine-tuned to adjust the embeddings for the new domain.

Fusion: Embeddings from the tuned and untuned IResNet100 models, along with the tuned CLIP model, are first individually normalised, then concatenated and re-normalised. This process effectively performs feature fusion, combining complementary information from each model.

Results with IResNet100 Tuning: Tuning IResNet100 primarily enhances TAR at low FAR values (<1%), with only modest improvements in EER. While fusion performs slightly worse than the tuned model in low-FAR TAR, it consistently outperforms both the base and tuned models overall.

Performance of IResNet100 variants with and without tuning and fusion.
Model	EER↓	TAR@0.1%FAR↑	TAR@1%FAR↑
IResNet100-Base	14.0%	29.9%	55.1%
IResNet100-Tuned	13.5%	36.5%	53.7%
IResNet100-Fusion (Base+Tuned)	13.5%	33.6%	58.4%

Results with CLIP-LoRA Tuning: Fine-tuning CLIP-LoRA using triplet loss, both with and without hard negative mining, shows that the hard negative mining variant does not achieve higher TAR at all FAR levels, but shows improved performance above approximately 1% FAR. This approach also leads to better EER.

Performance of loss functions for tuning CLIP. Models with Hard Negative Mining (HN) perform better overall.
Model	EER↓	TAR@0.1%FAR↑	TAR@1%FAR↑
CLIP-Base	17.9%	8.4%	33.2%
CLIP-LoRA(Triplet, HN)	13.1%	17.8%	43.5%
CLIP-LoRA(Triplet, w/o HN)	13.9%	16.8%	43.9%

ROC curves of tuned and base CLIP, IResNet100, COTS and proposed fusion method. Fusion provides consistent improvements even at low FAR.

Results with Fusion: Fusion methods with individual IResNet100 and CLIP models on the Historical Faces test split show that although the base CLIP model underperforms compared to IResNet100, fusion leads to improvements in EER. Fine-tuning CLIP-LoRA further enhances performance across all four metrics, with additional gains observed when the tuned IResNet100 is included in the fusion.

Performance Comparison of Base, Tuned models, Fusion, and COTS FR Systems. Fusion enhances overall accuracy.
Model	EER↓	TAR@0.1%FAR↑	TAR@1%FAR↑
COTS FR system	12.6%	34.3%	58.1%
CLIP-Base	17.9%	8.4%	33.2%
IResNet100-Base	14.0%	29.9%	55.1%
CLIP-Base + IResNet100-Base	13.1%	29.0%	54.7%
CLIP-Base + IResNet100-Tuned	12.6%	35.1%	57.9%
CLIP-LoRA + IResNet100-Base	11.1%	34.6%	62.6%
CLIP-LoRA + IResNet100-Tuned	10.7%	39.7%	62.15%
CLIP-LoRA + IResNet100-Base + IResNet100-Tuned	9.9%	39.7%	65.9%

Discussions: Tuning IResNet100 on the Historical Faces dataset yields only modest gains, but fusing tuned and untuned versions outperforms either alone, suggesting combining representations can help, even within a single architecture. CLIP-LoRA tuning shows triplet loss with hard negative mining performs best, significantly improving over base CLIP. Fusion experiments show that combining foundation models with conventional face recognition networks improves performance across all evaluation metrics, especially in low false acceptance rate (FAR) scenarios. This indicates that foundation models capture valuable information from portraits that conventional networks may overlook.

Examples of successful and failed comparisons. Each pair shows the reference and probe images, with the associated cosine similarity score. High similarity scores indicate successful matches, while low scores represent mismatches.

Visualisations: To better understand both successful and failed matches, we show sample images along with their similarity scores. Genuine comparisons can fail in challenging cases, particularly when there are significant differences in artistic style or medium, sitter age, and often compounded when multiple factors overlap. Conversely, some impostor comparisons yield surprisingly high similarity scores. This typically occurs when artists base their work on earlier portraits or when sitters resemble each other and are depicted in a similar style or by the same artist.

In this work, we show that lightweight tuning of vision-language foundation models, combined with domain-adapted face recognition networks, can effectively bridge the domain gap between photographs and paintings. Our fusion approach achieves state-of-the-art accuracy in sitter identification. Face recognition on artworks remains a particularly difficult task compared to traditional FR due to the scarcity of labelled data, stylistic variation, and the interpretive nature of portraiture. However, the results show that adapting modern architectures to this setting is feasible and promising. This opens up new research avenues, including synthetic data generation to augment the limited training set and heterogeneous domain adaptation techniques to improve generalisation across visual domains.

The source code of our experiments are publicly availabble:

ArtFace: Towards Historical Portrait Face Identification via Model Adaptation

Examples of successful and failed comparisons. Each pair shows the reference and probe images, with the associated cosine similarity score. High similarity scores indicate successful matches, while low scores represent mismatches.

Summary

Proposed Method

Results

Conclusions

Reproducibility: Source Code and Data

BibTeX