Skip to main content

Shutterstock's Content Datasets Now on Databricks Marketplace

Enhance Machine Learning Models with Robust Content Datasets from Shutterstock
Share this post

In today's data-driven world, the fusion of visual assets and analytical capabilities unlocks a realm of untapped potential. Image datasets are crucial in developing and training Generative AI (GenAI) technologies. We are thrilled to announce a groundbreaking collaboration that brings the vast collection of Shutterstock imagery to the Databricks Marketplace — our first listing of Volume (aka non-tabular) datasets on our Marketplace. This free sample dataset, which consists of 1,000 images and accompanying metadata sourced from Shutterstock's 550+ million image library, is available for immediate access. This blog will explore Shutterstock's image library on Databricks Marketplace and the industry use cases.

Why Databricks Marketplace?

Traditional data marketplaces are restricted and only offer tabular data or simple applications - so the value to data collaborators is limited. They also don't provide tools to evaluate the data sets. Databricks Marketplace is an open marketplace that enables you to share and exchange data assets such as tabular datasets, volumes, notebooks, and AI models across clouds, regions, and platforms. Since launching in June, Databricks Marketplace has over 1,800 listings from over 180 providers.

Databricks Marketplace

Shutterstock on Databricks Marketplace

"Shutterstock is bringing its vast collection of nearly a billion creative content assets to the Databricks Marketplace, a platform renowned for fostering open data and AI collaboration”, as per Aimee Egan, Chief Enterprise Officer, Shutterstock. According to Egan, “This integration provides unparalleled access to our extensive library of ethically-sourced visual content, propelling responsible AI and ML initiatives forward across various industries. We are excited to add Delta Sharing as a method to deliver data. Customers utilizing our rich dataset on Databricks can tap into new opportunities, catalyze product innovations, and secure a competitive advantage."

Shutterstock's datasets incorporate all the metadata, including keywords, descriptions, geo-locations, and categories, making organizing and searching for images easier. Examples of datasets include a wide range of industry categories like food and beverage, transportation and autonomous vehicles, animals and wildlife, clothing and apparel, travel, tourism and hospitality, etc.1 Shutterstock's image library plays a pivotal role in GenAI, serving as a foundational resource for training advanced AI models and multimodal models like OpenAI Dall-E.

"Shutterstock is bringing its vast collection of nearly a billion creative content assets to the Databricks Marketplace, a platform renowned for fostering open data and AI collaboration."
— Aimee Egan, Chief Enterprise Officer, Shutterstock

Watch the demo below to learn more about Shutterstock's listing, how to access it and query it using a notebook.

Unlocking New Possibilities and Use Cases

With Shutterstock's listing on the marketplace, here are common use cases across industries that drive innovation:

  • Media & Entertainment: Every day, users create millions of photographs. Media organizations can utilize machine learning models, enhanced by Shutterstock's vast library, to automatically interpret the content within these images. This capability enables them to refine their customer data for more effective ad targeting and increased engagement.
  • Retail: Apparel retailers want to generate personalized, "try before you buy," images showing how a new outfit appears on a person resembling the customer before they buy. Shutterstock's extensive, library gives retailers confidence to dynamically create accurate images without risk of licensing issues.
  • AI Startups: Companies at the forefront of specialized machine learning require clean, ethically sourced datasets to build models as the foundation of their business. Responsible AI has become essential to scaling a successful AI startup with direction from investors to avoid high profile lawsuits.

Shutterstock Uses Volume Sharing for Seamless Collaboration

Volumes are a type of object in Unity Catalog that simplifies the integration of non-tabular data as a collection of directories and files that you can access, store and manage in your governance framework.

As we recently announced, you can now share Volumes through Delta Sharing available in Public Preview. With Volume Sharing, you can securely share extensive collections of non-tabular data such as PDFs, images, videos, audio files and other documents – along with tables, notebooks and AI models – across clouds, regions and accounts.

This free sample dataset from Shutterstock represents the first Volume-based listing offered on the Databricks Marketplace. With access to Shutterstock's diverse collection of images and accompanying metadata, you can use Volume Sharing to incorporate this dataset into Generative AI applications using a Retrieval Augmented Generation (RAG) technique without copying the data.

Volume Sharing helps accelerate collaboration between business units or partners, as well as helping to onboard new collaborators across clouds, platforms, and regions. Data providers on Databricks Marketplace, such as Shutterstock, can now easily share any non-tabular data with consumers seamlessly and simply. This approach democratizes data access and significantly reduces the time and resources required to obtain and utilize high-quality datasets.

How does it all come together?

Let's walk through an example of a fictitious retailer, Berkeley FoodMart that wants to improve the description of products on its website. Well-optimized product listings are more likely to appear prominently in search engine results, attracting potential customers and increasing organic traffic. Additionally, optimized titles and descriptions compel users to click on the listings, resulting in higher click-through rates and more visitors exploring products.

The challenge? Berkeley FoodMart is like other grocers with 50,000 products in their store with 20% turnover each year, translating into hundreds of thousands or millions needing appropriate description. It's cost-prohibitive to manually maintain descriptions for all products. Given these costs, existing descriptions are often limited in breadth.

Berkeley FoodMart will leverage Shutterstock's diverse image datasets retrieved from Databricks Marketplace to help automate this. To automate the metadata and description of products on their website, Berkeley FoodMart will use Shutterstock's immense library of images, including brand and product data, and their own internal images to generate image-to-text analytics.

  1. First, Berkeley FoodMart will work with the Shutterstock team to identify how much and what data they need. Shutterstock can help customize the images they distribute based on volume and metadata search criteria. Shutterstock also distributes other data products, including video and audio data.
  2. Once the datasets are procured through Databricks Marketplace, Shutterstock datasets are shared with Berkeley FoodMart.
  3. The metadata of the Volumes shared with Berkeley FoodMart is available in Databricks Unity Catalog, mounted under the catalog name specified by Berkeley FoodMart.
    Berkeley FoodMart
  4. Berkeley FoodMart will leverage the Shutterstock dataset with its robust metadata to build the image-to-text model to generate metadata and keywords from new product images. Shutterstock image datasets are fully curated for Berkeley FoodMart to safely build their model with clear data origins. They'll use these keywords with an LLM to generate user-friendly product descriptions. Databricks fine-tuning lets Berkeley FoodMart do this easily by allowing them to start with their preferred LLM model and giving the ability to do further training on new datasets.
  5. Berkeley FoodMart will use Databricks Model Serving to deploy the fine-tuned model to a system where future images can be easily and automatically processed.
  6. This metadata and descriptions will be manually reviewed in the beginning, but over time the system will learn and enable more and more automation. This enables massive scale of rich product descriptions, ensuring Berkeley FoodMart users are able to find products easily.

Getting Started with Shutterstock on Databricks Marketplace

The future of AI and data-driven innovation is bright, and with tools like these at our disposal, there's no limit to what we can achieve together. Let's embark on this exciting journey and transform the landscape of technology and creativity.

Sources

  1. Shutterstock Data Licensing and the Contributor Fund
Try Databricks for free

Related posts

See all Platform Blog posts