Enhancing Web Image Accessibility for Visually Impaired Individuals with Gemini Pro Vision and Google Cloud Platform

Published in

Google Developer Experts

4 min readApr 25, 2024

Problem

The inability of visually impaired individuals access image information due to the lack of adherence to W3C web accessibility initiatives by websites. Currently, about 60% of websites lack meaningful alternate text for their images. Moreover, it is unfeasible to retroactively add descriptive text to all existing websites manually.

Two mins short Project Introduction and our story

Demo 1

Demo 2

Solution — GeProVis AI Screen Reader

GeProVis is an abbreviated term for Gemini Pro Vision, and my students have significantly enhanced the conventional Google ChromeVox Screen Reader by incorporating the robust capabilities of Google Gemini Pro Vision. This blog post will focus on the details of Google Cloud Platform (GCP). In brief, ChromeVox can extract the image source url and send it to GCP.

Frontend: The utilization of ChromeVox Classic screen reader, Cloud Function, and Google Gemini Pro Vision is prevalent. ChromeVox Classic is favored due to its comprehensive functionality and widespread acclaim. Cloud Function is employed owing to its adaptability, ready-to-use environment, superior scalability, and cost-effectiveness. Lastly, Gemini Pro Vision is chosen due to its prowess as the most potent AI model capable of describing an image in a split second, API-based, and its low usage cost. This project enhances the open-source screen reader, Google ChromeVox Classic, which serves as an extension to the Google Chrome browser. The extension module employs JavaScript.
Backend: The backend incorporates the Google Gemini Pro Vision API via the Google cloud function. This API generates descriptive text for images received in either URL or base64 format. The default language for this text is English, although this can be altered by supplying the ‘lang’ parameter. The text generated by the Gemini API is then sent to the Google Translation API for translation into the selected language and then returned to the front end.
Cloud Functions: Firstly, the user is retrieved through the API key. This process is protected by Google Cloud API Gateway, which also provides API key authentication and rate limiting. Following this, the image data is downloaded and if its size exceeds 3MB, it is adjusted. Then, gemini-1.0-pro-vision is invoked to acquire a caption for the image from all models’ available regions to maximize the rate limit, approximately 40 words long, based on the locale. Subsequently, the estimated cost for each AI call is computed. Finally, captions and usage are stored in the Google Cloud Datastore. To save cost and faster response, captions will be translated by Google translate on demand for duplicate image in different language.
Google Cloud Datastore: This contains three kinds: ApiKey, Caption, and Usage.

ApiKey: Matches an API key to a User ID.
Caption: Saves image hash to caption, which is a cache to skip AI calls for the same image. To protect piracy, we do not log down any URL.
Usage: Records usage for each user, including the cost and time. Each API key has a daily cost limit for budget control.

Current limitation

ChromeVox, behind the scenes, captures URLs and sends this information, along with the browser locale, to the Cloud Function. We have experimented with two different approaches. The first involves ChromeVox downloading the image and sending it to the cloud function, but this has occasionally encountered CORS permission issues. The second approach has ChromeVox sending the URL to the cloud function, which then downloads the image. However, this doesn’t work if the site requires a login to access the image.

After a testing period and gathering feedback from visually impaired users, we have chosen to proceed with the second approach.

We welcome any opinions or suggestions or pull request to help resolve these issues, as we recognize that we are not experts in web technology.

GitHub Repo

This project is entirely open-source, and you can easily deploy GCP resources using CDK-TF.

GitHub - wongcyrus/GeProVis-AI-Screen-Reader: Experience the magic of GeProVis AI Screen Reader in…

Experience the magic of GeProVis AI Screen Reader in this thrilling video. We've turbocharged the traditional Google…

github.com

Conclusion

This project is one of the top 100 candidate projects of the Google Developer Student Clubs — GDSC 2024 Solution Challenge. If you love this project, please give GDSC-HKIIT a vote by giving a like to the project background YouTube clip and share our story. The team does hope Google can add this project as a built-in feature into Chrome browser or even ChromeOS in the future!

We aspire that all developers and NGOs can adapt and implement this solution for their respective cities or countries. This is because most websites fail to provide appropriate “Alt” text, and none of the current screen readers can help visually impaired users comprehend the content of images.

About the Author

Cyrus Wong is the senior lecturer of Hong Kong Institute of Information Technology and he focuses on teaching public Cloud technologies. A passionate advocate for cloud tech adoption in media and events — AWS Machine Learning Hero, Microsoft MVP — Azure, and Google Developer Expert — Google Cloud Platform & AL/ML (GenAI).