Google has announced PaliGemma 2, the successor to its vision-language model, PaliGemma. These new models offer a range of sizes, from 3 billion to 28 billion parameters, and support various resolutions up to 896px, allowing for greater flexibility and customization.
PaliGemma 2 boasts improved performance in several areas, including:
- Chemical formula recognition
- Music score recognition
- Spatial reasoning
- Chest X-ray report generation
- Detailed, contextually relevant image captioning
The models are designed as drop-in replacements, minimizing the need for code modifications. Pre-trained models are available for free download and experimentation on Hugging Face and Kaggle. PaliGemma 2 supports multiple frameworks such as Hugging Face Transformers, Keras, PyTorch, JAX, and Gemma.cpp.
According to Google, PaliGemma 2's flexibility simplifies fine-tuning for specific tasks and datasets, empowering users to tailor the model to their exact requirements.