GCP Batch unable to fetch docker image after setting GCP Batch noExternalIPAddres =True

Hi, I followed the tutorial here to block external access for Batch Jobs https://cloud.google.com/batch/docs/job-without-external-access  and followed this tutorial to turn on Private Google Acesss https://cloud.google.com/vpc/docs/private-google-access. However, after this my Batch job is no longer able to download docker image and prints this error, is this expected? And how should I resolve this? Thanks!

 

ERROR 2024-05-11T17:18:58.802714628Z Unable to find image 'us-east1-docker.pkg.dev/PROJECT/DOCKER_NAME:latest' locally
ERROR 2024-05-11T17:18:58.840505637Z docker: Error response from daemon: Head "https://us-east1-docker.pkg.dev/v2/PROJECT/DOCKER_NAME/manifests/latest": denied: Unauthenticated request. Unauthenticated requests do not have permission "artifactregistry.repositories.downloadArtifacts" on resource "projects/PROJECT/locations/us-east1/repositories/ARTIFACTR_REGISTRY_NAME" (or it may not exist).
ERROR 2024-05-11T17:18:58.840538631Z See 'docker run --help'.

 

 

Solved Solved
3 6 255
1 ACCEPTED SOLUTION

Hi @gradientopt,

Thanks for your elaboration through message!

If your tasks for network with/without external ip address are both partially failed, I would suggest you also add retry on exit_code 125 to unblock yourself for this issue. We are doing further investigation on the flakiness.

If for your network without external ip case, it fails on every job/task, I would suggest you check the private network setting. You should be able to access Artifact Registry if you set up private google access properly.

Besides, would you mind trying to reduce the VM numbers to see whether that helps the flakiness? From my perception, each of your job will have 5000/4=1250 VMs, and 10 Jobs in parallel means 12500 VMs during the same period. Batch will help fetch the docker image on each VM, which means a high request on Artifact Registry at that time.

Thanks!

View solution in original post

6 REPLIES 6

Hi @gradientopt, from the log info you provided, the docker image url is `us-east1-docker.pkg.dev/PROJECT/DOCKER_NAME:latest`, which seems not have the proper PROJECT and DOCKER_NAME info. Would you mind double check the image url you provided to the Batch job request, or would you mind sharing one of your failed Job request json file or Job UID for me to help check? Thanks!

Hi, since this is this is a public forum I redacted my project name and docker name. I will send you a DM for the issue, thanks for the help!

Hi @gradientopt,

Thanks for your elaboration through message!

If your tasks for network with/without external ip address are both partially failed, I would suggest you also add retry on exit_code 125 to unblock yourself for this issue. We are doing further investigation on the flakiness.

If for your network without external ip case, it fails on every job/task, I would suggest you check the private network setting. You should be able to access Artifact Registry if you set up private google access properly.

Besides, would you mind trying to reduce the VM numbers to see whether that helps the flakiness? From my perception, each of your job will have 5000/4=1250 VMs, and 10 Jobs in parallel means 12500 VMs during the same period. Batch will help fetch the docker image on each VM, which means a high request on Artifact Registry at that time.

Thanks!

Got it! I am wondering what would be a reasonable VM number? 1000?

Thanks a lot! I am wondering if  there is an alternative to having to fetch docker images on each VM?

Hi @gradientopt,

Hope the former solution has unblocked you!

Before I give you a more accurate answer to your follow up questions, would you mind sharing me one example of your failed job (task)'s batch json logs, so that I can understand more about what specific root cause might be for your case?

Feel free to send me more info through private message (or chat with me there) if you are more comfortable with that, thanks!