Connect to Apache Spark

As a BigQuery administrator, you can create a connection to enable data analysts to run stored procedures for Apache Spark.

Before you begin

Location considerations

When you choose a location for your data, consider the following:

Multi-regions

You must specify Google Cloud resources located in the same large geographic area:

  • A connection in the BigQuery US multi-region can reference a Spark History Server or a Dataproc Metastore in any single region in the US geographic area, such as us-central1, us-east4, or us-west2.

  • A connection in the BigQuery EU multi-region can reference a Spark History Server or a Dataproc Metastore in member states of the European Union, such as europe-north1 or europe-west3.

Single regions

A connection in a single region can only reference Google Cloud resources in the same region. For example, a connection in the single region us-east4 can only reference a Spark History Server or a Dataproc Metastore in us-east4.

Create connections

Select one of the following options:

Console

  1. Go to the BigQuery page.

    Go to BigQuery

  2. To create a connection, click add Add data, and then click Connections to external data sources.

  3. In the Connection type list, select Apache Spark.

  4. In the Connection ID field, enter a name for your connection—for example, spark_connection.

  5. In the Data location list, select a region.

    You can create a connection in regions and multi-regions that support BigQuery. For more information, see Location considerations.

  6. Optional: From the Metastore service list, select a Dataproc Metastore.

  7. Optional: In the History server cluster field, enter a Dataproc Persistent History Server.

  8. Click Create connection.

  9. Click Go to connection.

  10. In the Connection info pane, copy the service account ID for use in a following step.

bq

  1. In a command-line environment, use the bq mk command to create a connection:

    bq mk --connection --connection_type='SPARK' \
     --properties=PROPERTIES \
     --project_id=PROJECT_ID \
     --location=LOCATION
     CONNECTION_ID
    

    Replace the following:

    • PROPERTIES: a key-value pair to provide connection-specific parameters in JSON format

      For example:

      --properties='{
      "metastoreServiceConfig": {"metastoreService": "METASTORE_SERVICE_NAME"},
      "sparkHistoryServerConfig": {"dataprocCluster": "DATAPROC_CLUSTER_NAME"}
      }'
      

      Replace the following:

    • PROJECT_ID: your Google Cloud project ID

    • LOCATION: the location where you want to store the connection—for example, US

    • CONNECTION_ID: the connection ID—for example, myconnection

      When you view the connection details in the Google Cloud console, the connection ID is the value in the last section of the fully qualified connection ID that is shown in Connection ID—for example projects/.../locations/.../connections/myconnection

  2. Retrieve and copy the service account ID because you need it in another step:

    bq show --location=LOCATION --connection PROJECT_ID.LOCATION.CONNECTION_ID
    

    The output is similar to the following:

    Connection myproject.us.myconnection
    
           name           type                    properties
    ---------------------- ------- ---------------------------------------------------
    myproject.us.myconnection  SPARK   {"serviceAccountId": "[email protected]"}
    

For information about how to manage connections, see Manage connections.

Grant access to the service account

To let a stored procedure for Apache Spark access your Google Cloud resources, you need to grant the service account that's associated with the stored procedure's connection the necessary IAM permissions. Alternatively, you can use your custom service account for data access.

  • To read and write data from and to BigQuery, you need to give the service account the following IAM permissions:

    • bigquery.tables.* on your BigQuery tables
    • bigquery.readsessions.* on your project

    The roles/bigquery.admin IAM role includes the permissions that the service account needs in order to read and write data from and to BigQuery.

  • To read and write data from and to Cloud Storage, you need to give the service account the storage.objects.* permission on your Cloud Storage objects.

    The roles/storage.objectAdmin IAM role includes the permissions that the service account needs in order to read and write data from and to Cloud Storage.

  • If you specify Dataproc Metastore when you create a connection, then for BigQuery to retrieve details about the metastore configuration, you need to give the service account the metastore.services.get permission on your Dataproc Metastore.

    The predefined roles/metastore.metadataViewer role includes the permission that the service account needs in order to retrieve details about the metastore configuration.

    You also need to grant the service account the roles/storage.objectAdmin role on the Cloud Storage bucket so that your stored procedure can access the Hive warehouse directory of your Dataproc Metastore (hive.metastore.warehouse.dir). If your stored procedure performs operations on the metastore, you might need to give additional permissions. For more information about IAM roles and permissions in Dataproc Metastore, see Dataproc Metastore predefined roles and permissions.

  • If you specify a Dataproc Persistent History Server when you create a connection, then you need to grant the service account the following roles:

    • The roles/dataproc.viewer role on your Dataproc Persistent History Server that contains the dataproc.clusters.get permission.
    • The roles/storage.objectAdmin role on the Cloud Storage bucket that you specify for the property spark:spark.history.fs.logDirectory when you create the Dataproc Persistent History Server.

    For more information, see Dataproc Persistent History Server and Dataproc roles and permissions.

Share connections with users

You can grant the following roles to let users query data and manage connections:

  • roles/bigquery.connectionUser: enables users to use connections to connect with external data sources and run queries on them.

  • roles/bigquery.connectionAdmin: enables users to manage connections.

For more information about IAM roles and permissions in BigQuery, see Predefined roles and permissions.

Select one of the following options:

Console

  1. Go to the BigQuery page.

    Go to BigQuery

    Connections are listed in your project, in a group called External connections.

  2. In the Explorer pane, click your project name > External connections > connection.

  3. In the Details pane, click Share to share a connection. Then do the following:

    1. In the Connection permissions dialog, share the connection with other principals by adding or editing principals.

    2. Click Save.

bq

You cannot share a connection with the bq command-line tool. To share a connection, use the Google Cloud console or the BigQuery Connections API method to share a connection.

API

Use the projects.locations.connections.setIAM method in the BigQuery Connections REST API reference section, and supply an instance of the policy resource.

Java

Before trying this sample, follow the Java setup instructions in the BigQuery quickstart using client libraries. For more information, see the BigQuery Java API reference documentation.

To authenticate to BigQuery, set up Application Default Credentials. For more information, see Set up authentication for client libraries.

import com.google.api.resourcenames.ResourceName;
import com.google.cloud.bigquery.connection.v1.ConnectionName;
import com.google.cloud.bigqueryconnection.v1.ConnectionServiceClient;
import com.google.iam.v1.Binding;
import com.google.iam.v1.Policy;
import com.google.iam.v1.SetIamPolicyRequest;
import java.io.IOException;

// Sample to share connections
public class ShareConnection {

  public static void main(String[] args) throws IOException {
    // TODO(developer): Replace these variables before running the sample.
    String projectId = "MY_PROJECT_ID";
    String location = "MY_LOCATION";
    String connectionId = "MY_CONNECTION_ID";
    shareConnection(projectId, location, connectionId);
  }

  static void shareConnection(String projectId, String location, String connectionId)
      throws IOException {
    try (ConnectionServiceClient client = ConnectionServiceClient.create()) {
      ResourceName resource = ConnectionName.of(projectId, location, connectionId);
      Binding binding =
          Binding.newBuilder()
              .addMembers("group:[email protected]")
              .setRole("roles/bigquery.connectionUser")
              .build();
      Policy policy = Policy.newBuilder().addBindings(binding).build();
      SetIamPolicyRequest request =
          SetIamPolicyRequest.newBuilder()
              .setResource(resource.toString())
              .setPolicy(policy)
              .build();
      client.setIamPolicy(request);
      System.out.println("Connection shared successfully");
    }
  }
}

What's next