Detecting Text Manipulation in Images using Vision Language Models

1Idiap Research Institute, 2UNIL
vlm as detector

VLM for Text Manipulation Detection: (a) Example from OSTF dataset of in-the-wild text manipulation (in red). (b) With the help of a user prompt, we ask a pretrained VLM whether the accompanying image contains a text manipulation. The output of VLM is then used as a label for binary classification. (c,d) Example from FantasyID dataset which simulate the real-world scenario of text manipulation in ID documents. (d) The altered text is shown in red.

Abstract

Recent works have shown the effectiveness of Large Vision Language Models (VLMs or LVLMs) in image manipulation detection. However, text manipulation detection is largely missing in these studies. We bridge this knowledge gap by analyzing closedand open-source VLMs on different text manipulation datasets. Our results suggest that open-source models are getting closer, but still behind closed-source ones like GPT4o. Additionally, we benchmark image manipulation detection-specific VLMs for text manipulation detection and show that they suffer from the generalization problem. We benchmark VLMs for manipulations done on in-the-wild scene texts and on fantasy ID cards, where the latter mimic a challenging real-world misuse.

Digital Manipulations

One the challenge in detecting digital text manipulation is that fraction of the area represented by the text in the image can be quite small as shown by the following examples.

org ID sample
Examples of text manipulation. Left: FantasyID Right: OSTF

Ablations

org ID sample
Left: Detailed prompt description increases performance. Right: Higher input image resolution leads to better performance.

Prompt: As we make our prompts more detailed the detection perfermance increases. Especially for the FantasyID, as they are easily declared as manipulated due to their fantasy nature.

Image resolution: Higher image resolution helps to detect smaller changes well. This is true for both VLMs and non-VLM method.

Benchmarking

We benchmark with OSTF and FantasyID dataset with following image manipulation detection algorithms: TruFor, FakeShield, SIDA, Qwen-2.5VL-72B-Instruct , Llama-3.2-Vision-90B, and GPT-4o (gpt-4o-2024-11-20) . The low and high variants represents performance, when input image resolution is set as low vs high.
Model Performance on OSTF and FantasyID: Metric F1 score for class pristine(P) and manipulation(M)
Model OSTF FantasyID
F1(P) F1(M) Avg F1 F1(P) F1(M) Avg F1
TruFor-low 0.72 0.17 0.45 0.64 0.20 0.42
TruFor-high 0.74 0.56 0.65 0.70 0.72 0.71
FakeShield 0.51 0.51 0.51 0.51 0.51 0.51
SIDA 0.70 0.24 0.47 0.67 0.01 0.34
Llama-3.2-90b-Vision 0.75 0.52 0.64 0.65 0.27 0.46
Qwen-2.5-VL-72b 0.85 0.74 0.79 0.72 0.40 0.56
GPT-4o-low 0.87 0.82 0.84 0.74 0.50 0.62
GPT-4o-high 0.86 0.85 0.86 0.84 0.86 0.85

GPT-4o is much better than open source models. Qwen-2.5 is the best performing open source model. Specialised VLMs like FakeShield and SIDA, fail to generalized to new task of text manipulation detection. Non-VLM baseline TruFor still remain competitive, especially when image resolution is high.

BibTeX


      @misc{vidit2025detectingtextmanipulationimages,
        title={Detecting Text Manipulation in Images using Vision Language Models}, 
        author={Vidit Vidit and Pavel Korshunov and Amir Mohammadi and Christophe Ecabert and Ketan Kotwal and Sébastien Marcel},
        year={2025},
        eprint={2509.10278},
        archivePrefix={arXiv},
        primaryClass={cs.CV},
        url={https://arxiv.org/abs/2509.10278}, 
  }
    
FantasyID Dataset

      @misc{korshunov2025fantasyiddatasetdetectingdigital,
      title={FantasyID: A dataset for detecting digital manipulations of ID-documents}, 
      author={Pavel Korshunov and Amir Mohammadi and Vidit Vidit and Christophe Ecabert and Sébastien Marcel},
      year={2025},
      eprint={2507.20808},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2507.20808}, 
      }