How to build a recommendation system on e-commerce data using BigQuery ML

Published in

Google Cloud - Community

10 min readJul 13, 2020

With your data in BigQuery, machine learning workflows are now easier than ever with BigQuery ML. In this post you’ll learn how to build a product recommendation system in a retail scenario using matrix factorization, and how to use the predicted recommendations to drive marketing activation.

Why are recommendation systems so important?

Photo by rupixen: https://unsplash.com/photos/Q59HmzK38eQ

The majority of consumers today expect personalization — to see products and services relevant to their interests. Naturally, they can help businesses too. By learning from user behaviours and preferences, businesses can deliver their recommendations in a variety of ways, including personalized coupons, marketing emails, and search results, or targeted ads. Ultimately, this enables businesses to attract more customer spending with targeted cross-selling or upselling, while reducing unnecessary costs by marketing irrelevant products.

“Companies that fail to show customers they know them and their buying preferences risk losing business to competitors who are more attuned to what their customers want.”
- Harvard Business Review. “The Age of Personalization”. September 2018

How do recommendation systems work?

Based on user preferences, matrix factorization (collaborative filtering) is one of the most common and effective methods of creating recommendation systems. For more information about how they work, see this introduction to recommendation systems here.

https://developers.google.com/machine-learning/recommendation/collaborative/matrix

What is BigQuery ML?

BigQuery ML enables users to create and execute machine learning models in BigQuery by using standard SQL queries. This means, if your data is already in BigQuery, you don’t need to export your data to train and deploy machine learning models — by training, you’re also deploying in the same step. Combined with BigQuery’s auto-scaling of compute resources, you won’t have to worry about spinning up a cluster or building a model training and deployment pipeline. This means you’ll be saving time building your machine learning pipeline, enabling your business to focus more on the value of machine learning instead of spending time setting up the infrastructure.

You may have also heard of Recommendations AI, a Google Cloud product purpose-built for real-time recommendations on a website using state-of-the-art deep learning models. Matrix factorization with BigQuery ML, on the other hand, is a more generic ML algorithm that can be used for offline and online recommendations (e.g. personalized e-mail campaigns).

Training & deploying a matrix factorization model on e-commerce data using BigQuery ML

Let’s go through an example of how to identify what other products you would recommend to a user, based on how long they’ve viewed a product on an online store. You can also view the Jupyter notebook version of this blogpost here.

The data

The Google Analytics Sample dataset, which is hosted publicly on BigQuery, is a dataset that provides 12 months (August 2016 to August 2017) of obfuscated Google Analytics 360 data from the Google Merchandise Store, a real e-commerce store that sells Google-branded merchandise.

Screenshot from the Google Merchandise Store.

Here’s a sample of some of the raw data from Google Analytics:

Pre-process the data

There are two types of matrix factorization based on the kind of user feedback: explicit and implicit. With explicit feedback, the dataset must indicate a user’s preference for a product, like a rating between 1 and 5 stars.

Unless a large number of users provide reviews of products, there is no or insufficient explicit feedback on how much a user liked a product. In such cases, other behavioral metrics need to be used to infer their implicit preference. One way to infer interest in a product is to consider the total time a user spends on a product detail page (e.g., session duration).

What should the training data look like?

With matrix factorization, in order to train the model, you will need a table with userId, itemId, and the rating. In this example, session duration on a product page will be used as the implicit rating. If you have other metrics (e.g., frequency of pageviews), you can simply combine the metrics together using a weighted sum to compute a rating value.

Example of cleaned training data for matrix factorization,
where session duration (in milliseconds) on a product page is used as the implicit rating

Using the Google Analytics Sample, we can create a table containing the pre-processed training data:

## follows schema from https://support.google.com/analytics/answer/3437719?hl=en&ref_topic=3416089CREATE OR REPLACE TABLE bqml.aggregate_web_stats AS (
  WITH
    durations AS (
      --calculate pageview durations
      SELECT
        CONCAT(fullVisitorID,'-', 
             CAST(visitNumber AS STRING),'-', 
             CAST(hitNumber AS STRING) ) AS visitorId_session_hit,
        LEAD(time, 1) OVER (
          PARTITION BY CONCAT(fullVisitorID,'-',CAST(visitNumber AS STRING))
          ORDER BY
          time ASC ) - time AS pageview_duration
      FROM
        `bigquery-public-data.google_analytics_sample.ga_sessions_2016*`,
        UNNEST(hits) AS hit 
    ),
      
    prodview_durations AS (
      --filter for product detail pages only
      SELECT
        CONCAT(fullVisitorID,'-',CAST(visitNumber AS STRING)) AS visitorId,
        productSKU AS itemId,
        IFNULL(dur.pageview_duration,
          1) AS pageview_duration,
      FROM
        `bigquery-public-data.google_analytics_sample.ga_sessions_2016*` t,
        UNNEST(hits) AS hits,
        UNNEST(hits.product) AS hits_product
      JOIN
        durations dur
      ON
        CONCAT(fullVisitorID,'-',
               CAST(visitNumber AS STRING),'-',
               CAST(hitNumber AS STRING)) = dur.visitorId_session_hit
      WHERE
      #action_type: Product detail views = 2
      eCommerceAction.action_type = "2" 
    ),
    
    aggregate_web_stats AS(
      --sum pageview durations by visitorId, itemId
      SELECT
        visitorId,
        itemId,
        SUM(pageview_duration) AS session_duration
      FROM
        prodview_durations
      GROUP BY
        visitorId,
        itemId )
    SELECT
      *
    FROM
      aggregate_web_stats
);-- Show table
SELECT
  *
FROM
  bqml.aggregate_web_stats
LIMIT
  10

Now we’re ready to train the matrix factorization model.

If you’d prefer to see this blogpost in a Jupyter notebook format, you can view it directly on Github:

https://github.com/GoogleCloudPlatform/analytics-componentized-patterns/blob/master/retail/recommendation-system/bqml/bqml_retail_recommendation_system.ipynb

Training the Matrix Factorization model in BigQuery ML

To train the matrix factorization model (with implicit feedback), you will need to set the options:

model_type: ‘matrix_factorization’
user_col: <user column name>
item_col: <item column name>
rating_col: <rating column name>
feedback_type: ‘implicit’ (default is ‘explicit’)

To learn more about the parameters when training a model, read the documentation on the CREATE MODEL statement for Matrix Factorization.

CREATE OR REPLACE MODEL bqml.retail_recommender
OPTIONS(model_type='matrix_factorization', 
        user_col='visitorId', 
        item_col='itemId',
        rating_col='session_duration',
        feedback_type='implicit'
        )
AS
SELECT * FROM bqml.aggregate_web_stats

Note: You may need to setup slot reservations. For more information, you can read up on how to set up flex slots programmatically or via the BigQuery UI.

Model Evaluation

After completing training, our model is now deployed (as bqml.retail_commender). You can inspect the resulting metrics from model evaluation.

For more information on these metrics, read the ML.EVALUATE documentation here.

SELECT
  *
FROM
  ML.EVALUATE(MODEL bqml.retail_recommender)

Hyperparameter Tuning

If you want to improve your model, some of the hyperparameters you can tune are:

NUM_FACTORS: Specifies the number of latent factors to use for matrix factorization models (int64_value)
L2_REG: The amount of L2 regularization applied (float64_value)
WALS_ALPHA: A hyperparameter for ‘IMPLICIT’ matrix factorization model (float64_value)

Check out this blogpost for more information on hyperparameter tuning in BigQuery ML.

Making Predictions using ML.RECOMMEND

Let’s do a quick check on what the resulting recommendations look like. What are the top 5 items you could recommend to a specific visitorId? Note that I’ve used a bit of SQL scripting to help us check for a specific visitor.

DECLARE MY_VISITORID STRING DEFAULT "0824461277962362623-1";SELECT
  *
FROM
  ML.RECOMMEND(MODEL `bqml.retail_recommender`,
      (SELECT MY_VISITORID as visitorID)
              )
ORDER BY predicted_session_duration_confidence DESC
LIMIT 5

Ordered descendingly by the confidence score, these show the top 5 items for this particular visitorId.

Let’s add in the product names by joining with productSKU so we can see what the recommended products actually are:

DECLARE
  MY_VISITORID STRING DEFAULT "6499749315992064304-2";
WITH
  product_details AS(
  SELECT
    productSKU,
    v2ProductName,
  FROM
    `bigquery-public-data.google_analytics_sample.ga_sessions_2016*`,
    UNNEST(hits) AS hits,
    UNNEST(hits.product) AS hits_product
  GROUP BY
    2,
    1 )
SELECT
  r.*,
  d.v2ProductName
FROM
  ML.RECOMMEND(MODEL `bqml.retail_recommender`,
    (
    SELECT
      MY_VISITORID AS visitorId)) r
JOIN
  product_details d
ON
  r.itemId = d.productSKU
ORDER BY
  predicted_session_duration_confidence DESC
LIMIT
  5

Product recommendations for this visitor

What about for all visitors? Since the resulting output can be a bit large, I’ve outputted the results to a table:

-- Create output table of top 5 predictions
CREATE OR REPLACE TABLE bqml.prod_recommendations AS (
WITH predictions AS (
    SELECT 
      visitorId, 
      ARRAY_AGG(STRUCT(itemId, 
                       predicted_session_duration_confidence)
                ORDER BY 
                  predicted_session_duration_confidence DESC
                LIMIT 5) as recommended
    FROM ML.RECOMMEND(MODEL bqml.retail_recommender)
    GROUP BY visitorId
)SELECT
  visitorId,
  itemId,
  predicted_session_duration_confidence
FROM
  predictions p,
  UNNEST(recommended)
);-- Show table
SELECT
  *
FROM
  bqml.prod_recommendations
ORDER BY 
  visitorId
LIMIT
  10

Using the predicted recommendations in production

Once you have the recommendations, plugging into your production pipeline will depend on your use case.

Here are a few possible ways to help you get started:

A. Export recommendations for marketing activation

For activation via Google Ads, Display & Video 360 and Search Ads 360
For activation via emails

B. Other ways to export recommendations from BigQuery

BigQuery to pandas dataframe
Export the predictions to Google Cloud Storage

By exporting the resulting predictions from BigQuery ML back to Google Analytics, you will be able to generate custom remarketing audiences and target customers more effectively with ads, search, or email activation.

A-1. Export recommendations to Google Analytics 360 (Google Marketing Platform)

Formatting the data for Google Analytics 360

You may need to format the data output into something that Google Analytics can directly use, for example:

Here’s a sample query for an itemId GGOEYOLR018699, that normalizes the confidence scores between 0 and 1, using ML.MIN_MAX_SCALER:

WITH predictions AS (
    SELECT 
      visitorId, 
      ARRAY_AGG(STRUCT(itemId, 
                       predicted_session_duration_confidence)
                ORDER BY 
                  predicted_session_duration_confidence) as recommended
    FROM ML.RECOMMEND(MODEL bqml.retail_recommender)
    WHERE itemId = "GGOEYOLR018699"
    GROUP BY visitorId
)
  
SELECT
  visitorId,
  ML.MIN_MAX_SCALER(predicted_session_duration_confidence) OVER() as GGOEYOLR018699
FROM
  predictions p,
  UNNEST(recommended)
ORDER BY GGOEYOLR018699 DESC

To create a column per product, you can use the pivot() function as described in this blogpost.

For Google Analytics Data Import, it’s recommended that you use clientId as the key, along with individual columns that show some propensity score. In other words, you may need to create a new column for each product that you are interested in recommending, and create a custom dimension in Google Analytics that can be then used to build your audiences. It’s also likely best to ensure that you have one row per clientId. If you know you will be exporting predictions to Google Analytics, it’s recommended that you train your models using clientId directly instead of visitorId.

Exporting the data from BigQuery into Google Analytics 360

The easiest way to export your BigQuery ML predictions from a BigQuery table to Google Analytics 360 is to use the MoDeM (Model Deployment for Marketing) reference implementation. MoDeM helps you load data into Google Analytics for eventual activation in Google Ads, Display & Video 360 and Search Ads 360

To export to Google Analytics 360 from BigQuery:

Follow the step-by-step instructions here to build your ETL pipeline from BigQuery ML to Google Analytics using MoDeM. You can also view the interactive instructions in this notebook.

A-2. Email activation using Salesforce Marketing Cloud

As Google Analytics does not contain email addresses, you may need to integrate with a 3rd-party platform like Salesforce Marketing Cloud for email activations.

Google Analytics 360 customers can activate their Analytics 360 audiences in Marketing Cloud on Salesforce direct marketing channels (email and SMS). This enables your marketing team to build audiences based on online web behavior and engage with those customers via emails and SMS.

Follow the step-by-step instructions here to integrate Google Analytics 360 with Salesforce Marketing Cloud, or learn more about Audience Activation through Salesforce Trailhead.

B. Other ways to export recommendations from BigQuery

B-1. BigQuery to pandas dataframe

With the predictions stored in a separate table, you can export the data into a Pandas dataframe using the BigQuery Storage API (see documentation and code samples). You can also use other BigQuery client libraries.

B-2. Export predictions table to Google Cloud Storage

There are several ways to export the predictions table to Google Cloud Storage, so that you can use them in a separate service. This documentation shows you, amongst other ways, how to export using the bq extract command.

Summary

Congratulations! You now know how to train your recommendation system using BigQuery ML, evaluate your model, tune hyperparameters and deploy your model, and use the results in production. Your business is now one step closer to being more personalized for your customers.

Want more?

Please leave me your comments with any suggestions or corrections.

I’m Polong Lin, a Developer Advocate for Google Cloud. Follow me on @polonglin, and all about BigQuery on reddit.com/r/bigquery.

Thanks to reviewers: Abhishek Kashyap, Lak Lakshmanan, Tai Conley, Rebecca Gutteridge, Oly Bhaumik, Marc Cohen, Tahir Fayyaz, Felipe Hoffa.