PaliGemma

PaliGemma is a lightweight open vision-language model (VLM) inspired by PaLI-3, and based on open components like the SigLIP vision model and the Gemma language model. PaliGemma takes both images and text as inputs and can answer questions about images with detail and context, meaning that PaliGemma can perform deeper analysis of images and provide useful insights, such as captioning for images and short videos, object detection, and reading text embedded within images.

There are two sets of PaliGemma models, a general purpose set and a research-oriented set:

PaliGemma - General purpose pretrained models that can be fine-tuned on a variety of tasks.
PaliGemma-FT - Research-oriented models that are fine-tuned on specific research datasets.

Important: Most PaliGemma models require tuning in order to produce useful results, except for the paligemma-3b-mix variant. Make sure you perform fine-tuning on these models and test the output before deploying them to end users.

Key benefits include:

Multimodal comprehension

Simultaneously understands both images and text.
Versatile base model

Can be fine-tuned on a wide range of vision-language tasks.
Off-the-shelf exploration

Comes with a checkpoint fine-tuned on on a mixture of tasks for immediate research use.

PaliGemma

Multimodal comprehension

Versatile base model

Off-the-shelf exploration

Learn more

View the model card

View on Kaggle

Run in Colab