To evaluate MLLMs for face verification, we provide MLLM with two face images and a text prompt. In the text prompt, we ask the model to compare the given images and return a similarity score:
We use the output of MLLM as similarity score (normalised to [0,1].) to evaluate the model for face verification.
We evaluate the demographic fairness of MLLMs on two face verification benchmarks: IJB-C and RFW datasets. The following table reports reports global and per-group EER together with TMR at three fixed FMR thresholds for both benchmarks. The DET curves in Fig. 2 visualise the full operating characteristic for every model.
The following table reports four FMR-based fairness metrics evaluated at the EER threshold and at three fixed operating points (FMR = 10%, 1%, 0.1%), together with the mean decidability index:
The following Figure shows the genuine and impostor score distributions for each demographic group. FaceLLM-8B shows the clearest separation between the two distributions on both benchmarks, which is consistent with its low EER. Ovis1.5 and Qwen2-VL-2B, on the other hand, have heavily overlapping genuine and impostor distributions, which explains their near-chance accuracy.
[Source Code] The source code of our experiments is publicly available: https://github.com/idiap/mllm-fairness
@article{mllm_fairness_2026,
author = {{\"U}nsal {\"O}zt{\"u}rk and Hatef Otroshi Shahreza and S{\'e}bastien Marcel},
title = {Demographic Fairness in Multimodal LLMs: A Benchmark of Gender and Ethnicity Bias in Face Verification},
journal = {arXiv preprint arXiv:2603.25613},
year = {2026}
}