Examples of successful and failed comparisons. Each pair shows the reference and probe images, with the associated cosine similarity score. High similarity scores indicate successful matches, while low scores represent mismatches.
Summary
Identifying sitters in historical paintings is a key task for art historians, offering insight into their lives and how they chose to be seen. However, the process is often subjective and limited by the lack of data and stylistic variations. Automated facial recognition is capable of handling challenging conditions and can assist, but while traditional facial recognition models perform well on photographs, they struggle with paintings due to domain shift and high intra-class variation. Artistic factors such as style, skill, intent, and influence from other works further complicate recognition.
The early work of Srinivasan et al. approached the problem using local and anthropometric facial features to statistically assess similarity between portrait pairs. More recent approaches rely heavily on deep learning, particularly convolutional neural networks (CNNs). Style transfer and uncertainty estimation have been used to improve performance, but challenges remain.
In this work, we investigate the potential of foundation models to improve facial recognition in artworks. By fine-tuning foundation models and integrating their embeddings with those from conventional facial recognition networks, we demonstrate notable improvements over current state-of-the-art methods. Our results show that foundation models can bridge the gap where traditional methods are ineffective.
Proposed Method
CLIP Foundation Model: CLIP is a vision-language foundation model trained on 400 million image-text pairs. It learns to align images and text in a shared embedding space, enabling zero-shot transfer to a wide range of tasks. Since foundation models are trained on diverse data, they capture broad concepts, detect stylistic differences, and can use contextual information relevant to paintings that traditional facial recognition overlooks. For our experiments, we use the CLIP ViT-B/16 model from OpenAI, selected for its balance between tunability and computational efficiency.
Face Recognition Model: AntelopeV2 is a state-of-the-art face recognition (FR) model built on the IResNet100 architecture and trained on the large-scale Glint360k dataset. It achieves high accuracy and robustness across challenging variations in pose, age, and lighting, making it a strong and widely adopted baseline in face verification benchmarks. We also use a commercial off-the-shelf (COTS) face recognition model as another baseline.
Model Adaptation: Both the CLIP and face recognition (FR) models are adapted using portrait images to better align with the domain of historical paintings. For CLIP, we apply Low-Rank Adaptation (LoRA), inserting LoRA layers into the query (Q) and value (V) matrices of the attention mechanism. This approach introduces low-rank updates, allowing efficient fine-tuning with significantly fewer trainable parameters and reduced memory usage, while maintaining performance. For the IResNet100 model, adaptation is minimal; only the final linear layer is fine-tuned to adjust the embeddings for the new domain.
Fusion: Embeddings from the tuned and untuned IResNet100 models, along with the tuned CLIP model, are first individually normalised, then concatenated and re-normalised. This process effectively performs feature fusion, combining complementary information from each model.
Results
Results with IResNet100 Tuning: Tuning IResNet100 primarily enhances TAR at low FAR values (<1%), with only modest improvements in EER. While fusion performs slightly worse than the tuned model in low-FAR TAR, it consistently outperforms both the base and tuned models overall.
Performance of IResNet100 variants with and without tuning and fusion.
Model
EER↓
TAR@0.1%FAR↑
TAR@1%FAR↑
IResNet100-Base
14.0%
29.9%
55.1%
IResNet100-Tuned
13.5%
36.5%
53.7%
IResNet100-Fusion (Base+Tuned)
13.5%
33.6%
58.4%
Results with CLIP-LoRA Tuning: Fine-tuning CLIP-LoRA using triplet loss, both with and without hard negative mining, shows that the hard negative mining variant does not achieve higher TAR at all FAR levels, but shows improved performance above approximately 1% FAR. This approach also leads to better EER.
Performance of loss functions for tuning CLIP. Models with Hard Negative Mining (HN) perform better overall.
Model
EER↓
TAR@0.1%FAR↑
TAR@1%FAR↑
CLIP-Base
17.9%
8.4%
33.2%
CLIP-LoRA(Triplet, HN)
13.1%
17.8%
43.5%
CLIP-LoRA(Triplet, w/o HN)
13.9%
16.8%
43.9%
ROC curves of tuned and base CLIP, IResNet100, COTS and proposed fusion method. Fusion provides consistent improvements even at low FAR.
Results with Fusion: Fusion methods with individual IResNet100 and CLIP models on the Historical Faces test split show that although the base CLIP model underperforms compared to IResNet100, fusion leads to improvements in EER. Fine-tuning CLIP-LoRA further enhances performance across all four metrics, with additional gains observed when the tuned IResNet100 is included in the fusion.
Performance Comparison of Base, Tuned models, Fusion, and COTS FR Systems. Fusion enhances overall accuracy.
Model
EER↓
TAR@0.1%FAR↑
TAR@1%FAR↑
COTS FR system
12.6%
34.3%
58.1%
CLIP-Base
17.9%
8.4%
33.2%
IResNet100-Base
14.0%
29.9%
55.1%
CLIP-Base + IResNet100-Base
13.1%
29.0%
54.7%
CLIP-Base + IResNet100-Tuned
12.6%
35.1%
57.9%
CLIP-LoRA + IResNet100-Base
11.1%
34.6%
62.6%
CLIP-LoRA + IResNet100-Tuned
10.7%
39.7%
62.15%
CLIP-LoRA + IResNet100-Base + IResNet100-Tuned
9.9%
39.7%
65.9%
Discussions: Tuning IResNet100 on the Historical Faces dataset yields only modest gains, but fusing tuned and untuned versions outperforms either alone, suggesting combining representations can help, even within a single architecture. CLIP-LoRA tuning shows triplet loss with hard negative mining performs best, significantly improving over base CLIP. Fusion experiments show that combining foundation models with conventional face recognition networks improves performance across all evaluation metrics, especially in low false acceptance rate (FAR) scenarios. This indicates that foundation models capture valuable information from portraits that conventional networks may overlook.
Examples of successful and failed comparisons. Each pair shows the reference and probe images, with the associated cosine similarity score. High similarity scores indicate successful matches, while low scores represent mismatches.
Visualisations: To better understand both successful and failed matches, we show sample images along with their similarity scores. Genuine comparisons can fail in challenging cases, particularly when there are significant differences in artistic style or medium, sitter age, and often compounded when multiple factors overlap. Conversely, some impostor comparisons yield surprisingly high similarity scores. This typically occurs when artists base their work on earlier portraits or when sitters resemble each other and are depicted in a similar style or by the same artist.
Conclusions
In this work, we show that lightweight tuning of vision-language foundation models, combined with domain-adapted face recognition networks, can effectively bridge the domain gap between photographs and paintings. Our fusion approach achieves state-of-the-art accuracy in sitter identification. Face recognition on artworks remains a particularly difficult task compared to traditional FR due to the scarcity of labelled data, stylistic variation, and the interpretive nature of portraiture. However, the results show that adapting modern architectures to this setting is feasible and promising. This opens up new research avenues, including synthetic data generation to augment the limited training set and heterogeneous domain adaptation techniques to improve generalisation across visual domains.
Reproducibility: Source Code and Data
The source code of our experiments are publicly availabble:
@article{poh2025artface,
title={ArtFace: Towards Historical Portrait Face Identification via Model Adaptation},
author={Poh, Francois and George, Anjith and Marcel, S{\'e}bastien},
journal={arXiv preprint arXiv:2508.20626},
year={2025}
}