Multimodal Prompt Engineering for Cross-Task Vision-Language Transfer

Main Article Content

Thoren Malrick
Ysella Corbette

Abstract

Large-scale vision-language models (VLMs) have demonstrated impressive zero-shot capabilities across tasks such as image captioning, visual question answering, and referring expression comprehension. However, their cross-task generalization remains limited, especially when moving between heterogeneous tasks with mismatched modalities or annotation formats. To address this, we propose a unified multimodal prompt engineering framework that formulates diverse visual-language tasks into a shared prompt space. Our method, called PROMPT-X, systematically encodes task instructions, modality cues, and context embeddings into learnable prompt templates that can be applied across multiple VLMs without retraining the backbone. By constructing a joint prompt-conditioned representation space, PROMPT-X enables effective cross-task transfer and adaptation. We evaluate the framework on four challenging benchmarks—COCO Captions, VQAv2, RefCOCO+, and GQA—and demonstrate that it significantly improves both in-domain and transfer performance. Visualizations in Figure 1 illustrate how PROMPT-X aligns task-agnostic prompts with modality-specific semantics, while Table 1 presents performance across prompt types and target tasks. Our findings suggest that prompt engineering, when elevated to a multimodal level, offers a scalable path toward general-purpose vision-language intelligence.

Article Details

How to Cite
Malrick, T., & Corbette, Y. (2025). Multimodal Prompt Engineering for Cross-Task Vision-Language Transfer. Journal of Computer Science and Software Applications, 5(5). https://doi.org/10.5281/zenodo.15381935
Section
Articles