[ML Story] Part 3: Deploy Gemma on Android

Nitin Tiwari
Google Developer Experts
5 min readApr 9, 2024

--

Written in collaboration with AI/ML GDE Aashi Dutt.

Introduction

In the preceding two articles, we successfully learned how to prepare your own dataset and fine-tune the Gemma model using supervised fine-tuning with LoRA.

Just as a journey needs companions, an ML project needs its model deployed. It’s like peanut butter without jelly — incomplete.

In the final part of this series, we’ll walk you through deploying the fine-tuned Gemma model on Android, enabling you to have a complete end-to-end application at your fingertips. Before we begin, ensure that you have Android Studio installed on your computer and an Android phone with good GPU capabilities.

Pipeline: Model Deployment

Why on-device deployment?
LLMs when deployed on mobile devices ensure data privacy along with offline availability without requiring an internet connection.

LLMs, unlike other ML models, tend to be considerably large in size due to the extensive number of parameters they were trained on. Consequently, deploying them on mobile devices poses challenges due to several factors:

(a) The large size of the model makes it impractical to bundle an LLM with the APK.

(b) Inferencing LLMs on mobile devices demands significant computational power and GPU capabilities.

However, there’s no need for concern, as the MediaPipe LLM Inference Task takes care of all the heavy lifting for us. But let’s first understand it.

MediaPipe LLM Inference API

The LLM Inference API enables you to run large language models directly on your device and is capable of performing a wide range of tasks such as text generation, question-answering, document summarization, etc.

Below are some of the open models supported by the LLM Inference Task:

  • Gemma 2B
  • Phi-2
  • Falcon-RW-1B
  • StableLM-3B

Our fine-tuned SciGemma model is derived from the Gemma 2B-it model, requiring conversion into a format compatible with the LLM Inference Task.

Step 1: Model Conversion

Let’s break down the model conversion process.

Model Conversion using MediaPipe Gen AI Converter

The original Gemma 2B-it model underwent fine-tuning on a Q&A dataset sourced from science textbooks, resulting in the fine-tuned model being in the safetensors format.

The fine-tuned model is subsequently transformed into a MediaPipe-compatible format using the Gen AI converter included in the MediaPipe PyPi package.

Let’s begin by installing the required dependencies and libraries.

# Install MediaPipe and PyTorch.
!pip install mediapipe
!pip install torch

# Import MP GenAI converter.
from mediapipe.tasks.python.genai import converter

Next, we’ll set up the conversion parameters, including the input model path, model format, type, vocabulary path, etc. You may need to adjust these parameter values based on your specific use case.

Refer to the official blog for a detailed description of each parameter.

# Configure conversion parameters.
config = converter.ConversionConfig(
input_ckpt='/content/fine_tuned_science_gemma2b-it/',
ckpt_format="safetensors",
model_type="GEMMA_2B",
backend='gpu',
output_dir='/content/intermediate/fine_tuned_science_gemma2b-it/',
combine_file_only=False,
vocab_model_file="/content/fine_tuned_science_gemma2b-it/",
output_tflite_file=f'/content/fine_tuned_science_gemma2b-it/scigemma.bin',
)

# Start model conversion.
converter.convert_checkpoint(config)

print("Model converted successfully.")

Following the conversion, the resulting model is about 2.52 GB in size, approximately 5 times smaller than the original Gemma 2B-it model. Now, let’s upload this model to Hugging Face for easy access whenever required.

from huggingface_hub import whoami
from pathlib import Path

# Output directory.
output_dir = "fine_tuned_science_gemma2b-it"
username = whoami(token=Path("/root/.cache/huggingface/"))["name"]
repo_id = f"{username}/{output_dir}"
from huggingface_hub import upload_folder, create_repo

repo_id = create_repo(repo_id, exist_ok=True).repo_id


upload_folder(
repo_id=repo_id,
folder_path=output_dir,
commit_message="Fine-tuned model pushed.",
ignore_patterns=["step_*", "epoch_*"],
)

The converted model is readily available on the Hugging Face repository. Download the scigemma.bin model file locally on your computer.

Step 2: Push the model on Android

The converted model is still huge in size and cannot be directly bundled with an APK. However, there’s an alternative method to deploy the model on-device without including it in the APK.

Android Debug Bridge (ADB) is a command-line tool that helps you communicate between the computer and device. Connect your Android phone with your computer and open command prompt/terminal.

# Initializ ADB shell.  
adb shell

# Remove any previous loaded model.
rm -r /data/local/tmp/llm/

exit

# Push the model on your Android phone.
adb push D:/Projects/SciGemma/ /data/local/tmp/llm/scigemma.bin
Push the model on Android

The above command creates a new directory locally on your phone from which the model will be invoked.

Step 3: Build the APK

We’re on to the final step of this project: building the APK. A ready-to-use template app inspired by the official MediaPipe example has already been created, which you can find on this GitHub repository.

Clone the repository on your local machine and build it in Android Studio. Once the app is built successfully, open InferenceModel.kt file and edit Line 44 by replacing YOUR_MODE_NAME.bin with the actual name of your model. In this case, our model’s name was scigemma.bin.

// Replace YOUR_MODEL_NAME with the name of your model file.
private const val MODEL_PATH = "/data/local/tmp/llm/YOUR_MODE_NAME.bin"
Configure the model path

Now, again build the app and install the APK on your mobile.

Final Android App — SciGemma

SciGemma is also available as a quick try-out on Hugging Face 🤗 Spaces.

SciGemma on HF Spaces

Watch the complete demo video on YouTube here.

SciGemma — Demo Video

Congratulations on your achievement of building and deploying your custom fine-tuned model on an Android device.

Putting it together

End-to-end pipeline

Throughout this series, you’ve gained valuable insights into preparing a custom dataset, fine-tuning the Gemma model using LoRA, converting the fine-tuned model to a MediaPipe-compatible task, and ultimately deploying the converted model on Android.

The complete code to fine-tune the model and Android application can be found on this GitHub repository: Gemma on Android. If you missed reading the previous two articles, you can find them here:

Part 1: Prepare your custom dataset
Part 2: Fine-tune Gemma 2b-it model

We hope you found this end-to-end project enlightening and enjoyable. Please consider giving a star to the repository and sharing it with others.

If you have any queries or questions, feel free to connect with Aashi or me on LinkedIn.

Acknowledgment

This project was developed during Google’s ML Developer Programs Gemma sprint. We thank the MLDP team for the opportunity.

--

--