Skip to content

Latest commit

 

History

History
552 lines (391 loc) · 21.7 KB

deploying_on_aws.md

File metadata and controls

552 lines (391 loc) · 21.7 KB

FLEDGE has been renamed to Protected Audience API. To learn more about the name change, see the blog post

FLEDGE Key/Value server deployment on AWS

This article is for adtech engineers who will set up the cloud infrastructure and run the Key/Value server for FLEDGE usage.

To learn more about FLEDGE and the Key/Value server, take a look at the following documents:

For the initial testing of the Key/Value server, you must have or create an Amazon Web Services (AWS) account. You'll need API access, as well as your key ID and secret key.

Set up your AWS account

Setup AWS CLI

Install the AWS CLI. Set up AWS CLI environment variables. The access key and secret environment variables are required to be exported for the server deployment process to work.

export AWS_ACCESS_KEY_ID=[[YOUR_ACCESS_KEY]]
export AWS_SECRET_ACCESS_KEY=[[YOUR_SECRET_KEY]]
export AWS_DEFAULT_REGION=[[YOUR_REGION]]

You can also add the environment variables to your shell startup script (such as .bashrc) to set them on load.

Setup S3 bucket for Terraform states

Terraform state data can be stored into S3, and the bucket must be manually created first. Create a bucket from the AWS UI Console or CLI and note the bucket name.

Gotcha: The bucket name must be globally unique

Then make sure that the bucket is accessible by the account running Terraform by adding the following to the bucket policy. The account ID can be found using the console, AWS CLI, or the API:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": {
                "AWS": "arn:aws:iam::[[ACCOUNT_ID]]:root"
            },
            "Action": "s3:ListBucket",
            "Resource": "arn:aws:s3:::[[BUCKET_NAME]]"
        },
        {
            "Effect": "Allow",
            "Principal": {
                "AWS": "arn:aws:iam::[[ACCOUNT_ID]]:root"
            },
            "Action": ["s3:GetObject", "s3:PutObject", "s3:DeleteObject"],
            "Resource": "arn:aws:s3:::[[BUCKET_NAME]]/*"
        }
    ]
}

In the AWS console, visit the Permissions tab of the S3 bucket. Click the "Edit" button for "Bucket policy", and copy-paste the above policy with your information filled out.

Build the Key/Value server artifacts

OS: The current build process can run on Debian. Other OS may not be well supported at this moment.

Before starting the build process, install Docker. If you run into any Docker access errors, follow the instructions for setting up sudoless Docker.

Get the source code from GitHub

The code for the FLEDGE Key/Value server is released on GitHub.

The main branch is under active development. For a more stable experience, please use the latest release branch.

Build the Amazon Machine Image (AMI)

From the Key/Value server repo folder, execute the following command:

prod_mode (default mode)

production/packaging/aws/build_and_test --with-ami us-east-1 --with-ami us-west-1

nonprod_mode

production/packaging/aws/build_and_test --with-ami us-east-1 --with-ami us-west-1 --mode nonprod

The script will build the Enclave Image File (EIF), store it in an AMI, and upload the AMI. If the build is successful, you will see an output similar to:

==> Builds finished. The artifacts of successful builds are:
--> amazon-ebs.dataserver: AMIs were created:
us-east-1: ami-0fc7e6b563291d9c6

Take a note of the AMI ID from the output as it will be used for Terraform later.

(Optional) Generate and upload a UDF delta file

We provide a default UDF implementation that is loaded into the server at startup.

To use your own UDF, refer to the UDF Delta file documentation to generate a UDF delta file.

Upload this UDF delta file to the S3 bucket that will be used for delta files before attempting to start the server.

Deployment

Push artifacts

Set and export the AWS_ECR environment variable to your AWS Elastic Container Registry address such as

export AWS_ECR=123456789.dkr.ecr.us-east-1.amazonaws.com
export AWS_REGION=us-east-1  # For example.

The URL for your default private registry is https://aws_account_id.dkr.ecr.region.amazonaws.com

Then run dist/aws/push_sqs to push the SQS cleanup lambda image to AWS ECR.

Set up Terraform

The setup scripts require Terraform version 1.2.3. There is a helper script /tools/terraform, which uses the official v1.2.3 Terraform docker image. Alternatively, you can download Terraform version 1.2.3 directly, or you can use Terraform Version Manager (tfenv) instead. If you use tfenv, run the following in your <repository_root> to set Terraform to version 1.2.3:

tfenv install 1.2.3;
tfenv use 1.2.3

Update Terraform configuration

For your Terraform configuration, you can use the template under production/terraform/aws/environments/demo. Copy the entire folder to another folder named with your environment name such as dev/staging/prod, name the files inside according to the region you want to deploy to, and update the following file content.

Update the [[REGION]].tfvars.json with Terraform variables for your environment. The description of each variable is described in AWS Terraform Vars doc.

Update the [[REGION]].backend.conf:

  • bucket - Set the bucket name that Terraform will use. The bucket was created in the previous Setup S3 bucket for Terraform states step.
  • key - Set the filename that Terraform will use.
  • region - Set the region where Terraform will run. This should be the same as the region in the variables defined.

Apply Terraform

From your repository/production/terraform/aws/environments folder, run:

ENVIRONMENT=[[YOUR_ENVIRONMENT_NAME]]
REGION=[[YOUR_AWS_REGION]]

Initialize the working directory containing Terraform configuration files:

terraform init --backend-config=./${ENVIRONMENT}/${REGION}.backend.conf --var-file=./${ENVIRONMENT}/${REGION}.tfvars.json --reconfigure

Generate/update AWS resources:

terraform apply --var-file=./${ENVIRONMENT}/${REGION}.tfvars.json

Once the operation completes, you can find the server URL in the kv_server_url value of the output.

Confirm resource generation

Once you have executed Terraform, the server URL will be available at the end of the console log (as the terraform output). To confirm, query "https://demo.kv-server.your-domain.example/v1/getvalues?kv_internal=hi" either through your browser or curl and you should be able to see a "Hello World" quote returned. See query the server section for more information.

Note: When the instance is created for the first time, the server may not be able to properly start up due to cloud dependency initialization. Wait for ten minutes or so and you should be able to query.

At least the following AWS resources should have been generated:

  • EC2
    • Visit the EC2 console and confirm that new instances have been generated.
    • There should be at least 2 instances, depending on the autoscaling capacity you have specified.
      • There is an SSH instance that you will use to SSH into the Key/Value server instances.
      • There are at least one or more Key/Value server instances that run the actual server code.
    • Confirm that "Instance state" is showing "Running", and "Status check" shows "2/2 checks passed".
  • S3
    • Visit the S3 console and confirm that a new bucket has been created.
    • In the bucket detail page, check the "Event notification" section under the "Properties" tab. The bucket should be associated with an SNS topic.
  • SNS/SQS
    • Visit the SNS console and confirm that the topic we saw associated with the bucket exists, and that it has an SQS subscribed to it.

Setting up routing

For the DSP Key/Value server, the server's origin must match the interest group owner's origin. For the SSP Key/Value server, the server's origin must match the seller's origin in the auction config.

For example, the interest group owner's origin may be https://dsp.example where https://dsp.example/scripts/bid.js serves the bidding script and https://dsp.example/bidding-signals serves the response from the DSP Key/Value server. For SSP, the seller's origin may be https://ssp.example where https://ssp.example/scripts/ad-scoring.js serves the ad scoring script, and https://ssp.example/scoring-signals serves the response from the SSP Key/Value server.

Since each infrastructure architecture is different (for example, your web serving setup may be using a static S3 bucket, a lambda, or an EC2 server), the actual implementation steps are out of scope for this documentation.

Loading data into the server

Refer to the FLEDGE Key/Value data loading guide documentation for loading data to be queried into the server.

Common operations

Query the server

When you run terraform apply, the output will include the server URL as the kv_server_url value. You can also get the server URL by visiting "Route 53 / Hosted zones / Records". The hosted zone will be named as the value of module/root_domain in the Terraform config. The URL of your environment will be in the format of: https://[[ENVIRONMENT]].[[ROOT_DOMAIN]], and the GET path is /v1/getvalues.

Once you have constructed your URL, you can use curl to query the server:

KV_SERVER_URL="https://demo.kv-server.your-domain.example"
curl ${KV_SERVER_URL}/v1/getvalues?keys=foo1

Since 7.47.0. curl by default send request via HTTP/2 protocol curl-http2. The terraform setup has the KV load balancer listen to HTTP/2 on port 8443 and HTTP1.1 on port 443. To query the server using http1.1 request protocol:

KV_SERVER_URL="https://demo.kv-server.your-domain.example"
curl ${KV_SERVER_URL}/v1/getvalues?keys=foo1 --http1.1

To test the UDF functionality, query the V2 endpoint (HTTP or gRPC).

BODY='{ "metadata": { "hostname": "example.com" }, "partitions": [{ "id": 0, "compressionGroupId": 0, "arguments": [{ "tags": [ "custom", "keys" ], "data": [ "foo1" ] }] }] }'

HTTP:

Currently, the HTTP2 endpoint (port 8443) for V2 does not work. We are working on a fix for the next release. Please use the HTTP1 endpoint (port 443) instead.

curl -vX PUT -d "$BODY"  ${KV_SERVER_URL}/v2/getvalues

Or gRPC (using grpcurl):

grpcurl --protoset dist/query_api_descriptor_set.pb -d '{"raw_body": {"data": "'"$(echo -n $BODY|base64 -w 0)"'"}}' demo.kv-server.your-domain.example:8443 kv_server.v2.KeyValueService/GetValuesHttp

SSH into EC2

how a single SSH instance is used to log into multiple server instances

Step 1: SSH into the SSH EC2 instance

The SSH instance is a dedicated EC2 instance for operators to SSH from the public internet. Access to this instance is controlled by an IAM group named kv-server-[[ENVIRONMENT]]-ssh-users created as part of applying terraform. Membership is managed through the AWS Console and we need to make sure that our IAM user is a member before proceeding. We will need either the instance id (if connecting using EC2 instance connect cli) or the public IP dns (if connecting using own key and SSH client) of the SSH instance and both can be retrieved from the EC2 dashboard.

where to find the instance id or public dns for the EC2 instance

Confirm that you can SSH into your SSH EC2 instance by following the instructions on Connect using EC2 Instance Connect. For example, to connect to the SSH instance using EC2 instance connect cli from a Linux machine, install the cli using the following command and restart your terminal.

Note: the following command assumes you've created a python3 virtualenv named ec2cli, though you can choose to install this tool a different way.

ec2cli/bin/pip3 install ec2instanceconnectcli

Then, login into the SSH instance using the instance id and specifying the [[REGION]]:

ec2cli/bin/mssh i-0b427bcab8fe23afb --region us-east-1

If you are having trouble connecting to your EC2 instance, look through the AWS SSH connection article. To perform advanced operations such as copying files, follow the instructions in the article to set up connection using your own keys.

Step 2: SSH into the actual EC2 instance from the SSH instance

Once you have logged into the SSH instance, login to the desired server instance by following the instructions on Connect using EC2 Instance Connect again. For example, to connect to our desired server instance using EC2 instance connect cli, use the following (note that mssh is already pre-installed on the SSH instance and our desired server instance id is i-00f54fe22aa47367f):

mssh i-00f54fe22aa47367f --region us-east-1

Once you have connected to the instance, run ls to see the content of the server. The output should look similar to something like this:

[ec2-user@ip-10-0-174-130 ~]$ ls
proxy  server_enclave_image.eif  vsockproxy.service
[ec2-user@ip-10-0-174-130 ~]$

Check the Key/Value server

The server EC2 instance is set up to automatically start the Key/Value server on setup, and when you SSH into your server instance the first time, the server should be already running. Verify the server is running by executing:

nitro-cli describe-enclaves

You should see an output similar to the following:

[
    {
        "EnclaveName": "server_enclave_image",
        "EnclaveID": "i-02f630b0378c28341-enc18212a6cae2dae4",
        "ProcessID": 5379,
        "EnclaveCID": 16,
        "NumberOfCPUs": 2,
        "CPUIDs": [1, 3],
        "MemoryMiB": 4096,
        "State": "RUNNING",
        "Flags": "DEBUG_MODE"
        // ... and more
    }
]

Read the server log

Most recent server (nonprod_mode) console logs can be read by executing the following command:

ENCLAVE_ID=$(nitro-cli describe-enclaves | jq -r ".[0].EnclaveID"); [ "$ENCLAVE_ID" != "null" ] && nitro-cli console --enclave-id ${ENCLAVE_ID}

If enable_otel_logger parameter is set to true, KV server also exports server logs to Cloudwatch via otel collector, located at Cloudwatch log group kv-server-log-group More details about logging in prod mode and nonprod mode in developing the server.

Start the server

If you have shutdown your server for any reason, you can start the Key/Value server by executing the following command:

nitro-cli run-enclave --cpu-count 2 --memory 3072 --eif-path /home/ec2-user/server_enclave_image.eif --debug-mode --enclave-cid 16

Terminate the server

Terminate the Key/Value server by executing the following command:

ENCLAVE_ID=$(nitro-cli describe-enclaves | jq -r ".[0].EnclaveID"); [ "$ENCLAVE_ID" != "null" ] && sudo nitro-cli terminate-enclave --enclave-id ${ENCLAVE_ID}

Updating the server

When a new server code is released, pull down the latest code and re-execute the steps from the previous Build the Amazon Machine Image section. Execute the following command:

production/packaging/aws/build_and_test --with-ami us-east-1 --with-ami us-west-1

Then set the new AMI ID in the Terraform config. Re-apply Terraform to deploy the updated server. Note that after the terraform is complete, you still might need to wait until all ec2 instances have the new AMI ID. You can query your ec2 instances in the UI, since all of them will have your environment name in their names. You can check the AMI ID in the AWS UI. You can also terminate EC2 instances with the old AMI IDs manually, if you so choose.

For development on non-production instances, a faster approach is available in the developer guide.

Running the server outside the TEE

For debugging purposes, it is possible to run the server outside of the TEE in a docker container. The docker image is included in the AMI and located under /home/ec2-user/server_docker_image.tar.

There are several options to do so:

1. Using terraform

This will create a new instance and start running the server in a Docker container. If you have running instances in the autoscaling group, they will be terminated and replaced with servers running outside the TEE.

  1. Update your [[REGION]].tfvars.json by setting "run_server_outside_tee": true
  2. Follow the deployment steps if setting up for the first time or just apply terraform to update existing terraform configurations.

To inspect container logs:

  1. Get the id for the docker container:

    docker container ls --filter ancestor=bazel/production/packaging/aws/data_server:server_docker_image
  2. Use the docker logs command.

2. SSH into instance & run Docker

Alternatively, you can SSH into an existing server instance and start the Docker container manually.

  1. Load the docker image

    docker load -i /home/ec2-user/server_docker_image.tar
  2. Make sure to stop any existing servers, inside or outside the TEE.

  3. Stop the proxy

    sudo systemctl stop vsockproxy.service
  4. Run the docker container

    docker run -d --init --rm --network host --security-opt=seccomp=unconfined  \
    --entrypoint=/init_server_basic bazel/production/packaging/aws/data_server:server_docker_image -- --port 50051 --v=5

Viewing Telemetry

Metrics

Metrics are exported to both Cloudwatch and Prometheus, both are hosted services by Amazon (otel_collector_config.yaml).

Cloudwatch

Metrics in Cloudwatch can be viewed in the AWS Cloudwatch console.

Prometheus

Amazon managed Prometheus is configured as part of Terraform. Querying Prometheus can be done using PromQL from the command line. For example:

docker run --rm -it okigan/awscurl --access_key $AWS_ACCESS_KEY_ID  --secret_key $AWS_SECRET_ACCESS_KEY  --region us-east-1 --service aps $AMP_QUERY_ENDPOINT?query=EventStatus

More complex queries can be run using POST:

docker run --rm -it okigan/awscurl --access_key $AWS_ACCESS_KEY_ID  --secret_key $AWS_SECRET_ACCESS_KEY  --region us-east-1 --service aps $AMP_QUERY_ENDPOINT -X POST  -H "Content-Type: application/x-www-form-urlencoded" --data 'query=Latency_bucket{event="ReceivedLowLatencyNotifications"}'

The AMP_QUERY_ENDPOINT can be found in the AWS Prometheus console.

More information on PromQL can be found in the Prometheus documentation.

Traces

Traces are exported to AWS-Xray, which is configured as part of Terraform. Traces are visualized in the AWS Xray console.

Frequently asked questions

How to rename resources via Terraform

If you wish to rename resources that have already been generated in a dev environment, then you can run terraform destroy to take down the resources, and you can run terraform apply again.

If your server is already running in production, and you cannot destroy the resources, refer to the state mv command documentation.

How to update resource allocation

You may run into a case where the server fails to start due to resource allocation errors, such as "Insufficient CPUs available in the pool. User providedcpu-countis 2, which is more than the configured CPU pool size."

The resources are allocated by specifying the per-TEE values in the terraform variable file, enclave_cpu_count and enclave_memory_mib.

How is private communication configured?

See this doc for more details.