Skip to content

Commit

Permalink
Tpu training on gke (#18)
Browse files Browse the repository at this point in the history
* Initial checkin of the TPU training on GKE reference guide

* Updates to the main README

* Created reusable Terraform modules

* Modified Terraform scripts

* Updated Terraform modules

* Updated Terraform modules

* Updated Terraform modules

* Updated the README files

* Increased time-outs in TPU node pool creation

* Updated Terraform

* Updated Terraform

* Updated JobSet and Kueue configuration

* Updated Cloud Build configuration

* Updated the main README

* Updated the setup for examples

* Updated the main README

* Updated Hello World examples

* Updated Hello World examples

* Updated hello world examples

* Updated maxtext examples

* Updated maxtext examples

* Updated the jobset examples

* Updated the jobset examples

* Updates to JobSet examples

* Updates to JobSet examples

* Updated the README

* Updated the README

* Updated the Jobset examples

* wip

* Updated xpk examples

* Updated the Terraform module docs

* Updated the Terraform module readme

* Updated the README for Terraform modules

* Updated Terraform to support v5p

* Cleanup Terraform

* Clean up Kustomize

* Updated the main README

* Updated Terraform

* Updated the README

* Updated JobSet examples

* updated xpk examples
  • Loading branch information
jarokaz committed Dec 25, 2023
1 parent 3041b7b commit 6935754
Show file tree
Hide file tree
Showing 89 changed files with 4,550 additions and 0 deletions.
10 changes: 10 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -127,3 +127,13 @@ dmypy.json

# Pyre type checker
.pyre/

# Terraform
*.tfvars
*.tfvars.json

.terraform.lock.hcl
**/.terraform/*

*.tfstate
*.tfstate.backup
134 changes: 134 additions & 0 deletions ai-infrastructure/terraform-modules/bootstrap/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,134 @@
# Automation bootstrap

This Terraform module establishes the initial configuration of a GCP project that requires elevated administrative permissions. Its primary objective is to set up Terraform and Cloud Build automation for subsequent provisioning tasks. The module enables the specified set of services and sets up an automation service account along with an automation GCS bucket. Optionally, the module can create a GCP project.

## Examples

```
module "automation_bootstrap" {
source = "github.com/GoogleCloudPlatform/applied-ai-engineering-samples//ai-infrastructure/terraform-modules/bootstrap"
project_id = "project-id"
automation_bucket = {
name = "automation-bucket-name"
location = "us-central1"
automation_sa_name = "service-account-name"
services = [
"aiplatform.googleapis.com"
]
roles = [
"roles/aiplatform.user"
]
}
```

By default the module enables the following services:

- accesscontextmanager.googleapis.com
- artifactregistry.googleapis.com
- cloudbuild.googleapis.com
- cloudkms.googleapis.com
- cloudresourcemanager.googleapis.com
- container.googleapis.com
- compute.googleapis.com
- container.googleapis.com
- iam.googleapis.com
- iamcredentials.googleapis.com
- serviceusage.googleapis.com
- sourcerepo.googleapis.com
- stackdriver.googleapis.com
- storage-component.googleapis.com
- storage.googleapis.com
- sts.googleapis.com

You can specify additional services to enable through the services input variable.

By default, the following roles are assigned to the automation service account:

- roles/iam.securityAdmin
- roles/iam.serviceAccountAdmin
- roles/compute.networkAdmin
- roles/container.admin
- roles/iam.serviceAccountUser
- roles/storage.admin
- roles/artifactregistry.admin

You can specify additional roles to assign to the automation service account through the roles input variable.


## Impersonating automation service account

To be able to use the automation service account, the account that will be used to run Terraform commands in the other deployment stages needs to have the `iam.serviceAccountTokenCreator` rights on the automation service account. You can grant this permission using the following command. Make sure to set the AUTOMATION_SERVICE_ACCOUNT and TERRAFORM_USER_ACCOUNT variables to the email addresses of the accounts in your environment.


```
AUTOMATION_SERVICE_ACCOUNT=you-automation-service-account-name@jk-mlops-dev.iam.gserviceaccount.com
[email protected]
gcloud iam service-accounts add-iam-policy-binding $AUTOMATION_SERVICE_ACCOUNT --member="user:$TERRAFORM_USER_ACCOUNT" --role='roles/iam.serviceAccountTokenCreator'
```

If the impersonating account itself is a service account, such as the Cloud Build service account:


```
AUTOMATION_SERVICE_ACCOUNT=you-automation-service-account-name@jk-mlops-dev.iam.gserviceaccount.com
[email protected]
gcloud iam service-accounts add-iam-policy-binding $AUTOMATION_SERVICE_ACCOUNT --member="serviceAccount:$TERRAFORM_USER_ACCOUNT" --role='roles/iam.serviceAccountTokenCreator'
```


## Input variables

| Name | Description | Type | Required | Default |
|---|---|---|---|---|
|[project_id](variables.tf#L31)| The project ID, where to enable services and create an automation service account and an automation bucket|`string`| ✓ ||
|[deletion_protection](variables.tf#L28)|Prevent Terraform from destroying the automation bucket. When this field is set, a terraform destroy or terraform apply that would delete the bucket will fail.|`string`||`true`|
|[automation_bucket](variables.tf#L22)| Settings for the automation bucket |`map(strings)`|✓||
|[automation_sa_name](variables.tf#L37)|The name of the automation service account|`string`| ✓||
|[services](variables.tf#L43)|The list of additional services to enable|`list(strings)`| ✓ ||
|[roles](varialbes.tf#L50)|The list of additional roles to assign to the automation service account|`list(strings)`|✓ ||


## Outputs

| Name | Description |
|---|---|
|[automation_sa](outputs.tf#L42)|The email of the automation service account|
|[automation_gcs](outputs.tf#L37)|The name of the automation bucket|



The module also creates two files in the `gs://<AUTOMATION_BUCKET_NAME>/providers`

- the `providers.tf` file

```
provider "google" {
impersonate_service_account = "[email protected]"
}
provider "google-beta" {
impersonate_service_account = "[email protected]"
}
```

- the `backend.tf` file

```
terraform {
backend "gcs" {
bucket = "automation-bucket-name"
impersonate_service_account = "[email protected]"
# remove the newline between quotes and set the prefix to the folder for Terraform state
prefix = "
"
}
}
```

You can utilize these files in the downstream Terraform stages to configure the management of Terraform state in Cloud Storage and enable Terraform impersonation.





90 changes: 90 additions & 0 deletions ai-infrastructure/terraform-modules/bootstrap/main.tf
Original file line number Diff line number Diff line change
@@ -0,0 +1,90 @@
# Copyright 2023 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

locals {
gcs_storage_class = (
length(split("-", var.automation_bucket.location)) < 2
? "MULTI_REGIONAL"
: "REGIONAL"
)

default_services = [
"accesscontextmanager.googleapis.com",
"artifactregistry.googleapis.com",
"cloudbuild.googleapis.com",
"cloudkms.googleapis.com",
"cloudresourcemanager.googleapis.com",
"container.googleapis.com",
"compute.googleapis.com",
"container.googleapis.com",
"iam.googleapis.com",
"iamcredentials.googleapis.com",
"serviceusage.googleapis.com",
"sourcerepo.googleapis.com",
"stackdriver.googleapis.com",
"storage-component.googleapis.com",
"storage.googleapis.com",
"sts.googleapis.com"
]
services = concat(local.default_services, var.services)

default_roles = [
"roles/iam.securityAdmin",
"roles/iam.serviceAccountAdmin",
"roles/compute.networkAdmin",
"roles/container.admin",
"roles/iam.serviceAccountUser",
"roles/storage.admin",
"roles/artifactregistry.admin",
]
roles = concat(local.default_roles, var.roles)
}

module "project_config" {
source = "github.com/GoogleCloudPlatform/cloud-foundation-fabric//modules/project?ref=v28.0.0&depth=1"
name = var.project_id
project_create = false
services = local.services
}

module "automation_gcs" {
source = "github.com/GoogleCloudPlatform/cloud-foundation-fabric//modules/gcs?ref=v28.0.0&depth=1"
project_id = module.project_config.project_id
name = var.automation_bucket.name
location = var.automation_bucket.location
storage_class = local.gcs_storage_class
versioning = true
force_destroy = var.deletion_protection ? false : true
}


module "automation_sa" {
source = "github.com/GoogleCloudPlatform/cloud-foundation-fabric//modules/iam-service-account?ref=v28.0.0&depth=1"
project_id = module.project_config.project_id
name = var.automation_sa_name
display_name = "Terraform automation service account."
# allow SA used by CI/CD workflow to impersonate this SA
#iam = {
# "roles/iam.serviceAccountTokenCreator" = compact([
# try(module.automation-tf-cicd-sa["bootstrap"].iam_email, null)
# ])
#}
iam_storage_roles = {
(module.automation_gcs.name) = ["roles/storage.admin"]
}
iam_project_roles = {
"${module.project_config.project_id}" = local.roles

}
}
52 changes: 52 additions & 0 deletions ai-infrastructure/terraform-modules/bootstrap/outputs.tf
Original file line number Diff line number Diff line change
@@ -0,0 +1,52 @@
# Copyright 2023 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.


locals {
_tpl_providers = "${path.module}/templates/providers.tf.tpl"
_tpl_backend = "${path.module}/templates/backend.tf.tpl"
providers = {
"providers" = templatefile(local._tpl_providers, {
sa = module.automation_sa.email
})

"backend" = templatefile(local._tpl_backend, {
backend_extra = join("\n", [
"# remove the newline between quotes and set the prefix to the folder for Terraform state",
"prefix = \"",
"\""
])
bucket = module.automation_gcs.name
sa = module.automation_sa.email
})
}
}

output "automation_gcs" {
description = "GCS bucket where Terraform automation artifacts are managed"
value = module.automation_gcs.name
}

output "automation_sa" {
description = "The email of the automation service account"
value = module.automation_sa.email
}

resource "google_storage_bucket_object" "providers" {
for_each = local.providers
bucket = module.automation_gcs.name
name = "providers/${each.key}.tf"
content = each.value
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
# Copyright 2023 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.


terraform {
backend "gcs" {
bucket = "${bucket}"
impersonate_service_account = "${sa}"
%{~ if backend_extra != null ~}
${indent(4, backend_extra)}
%{~ endif ~}
}
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
# Copyright 2023 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.


provider "google" {
impersonate_service_account = "${sa}"
}
provider "google-beta" {
impersonate_service_account = "${sa}"
}

55 changes: 55 additions & 0 deletions ai-infrastructure/terraform-modules/bootstrap/variables.tf
Original file line number Diff line number Diff line change
@@ -0,0 +1,55 @@
# Copyright 2023 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

variable "deletion_protection" {
description = "Prevent Terraform from destroying data storage resources (storage buckets, GKE clusters). When this field is set, a terraform destroy or terraform apply that would delete data storage resources will fail."
type = bool
default = true
nullable = false
}

variable "automation_bucket" {
description = "The parameters of the bucket to be used by automation tools including Terraform backend"
type = object({
name = string
location = string
})
nullable = false
}

variable "project_id" {
description = "The GCP project ID"
type = string
nullable = false
}

variable "automation_sa_name" {
description = "The name of the automation service account"
type = string
nullable = false
}

variable "services" {
description = "Additional services to enable"
type = list(string)
default = []
nullable = false
}

variable "roles" {
description = "Additional roles to add to an automation account"
type = list(string)
default = []
nullable = false
}
Loading

0 comments on commit 6935754

Please sign in to comment.