Resolving resource limit issues in Cloud Service Mesh

This section explains common Cloud Service Mesh problems and how to resolve them. If you need additional assistance, see Getting support.

Cloud Service Mesh resource limit problems can be caused by any of the following:

  • LimitRange objects created in the istio-system namespace or any namespace with automatic sidecar injection enabled.
  • User-defined limits that are set too low.
  • Nodes run out of memory or other resources.

Potential symptoms of resource problems:

  • Cloud Service Mesh repeatedly not receiving configuration from the control plane indicated by the error, Envoy proxy NOT ready. Seeing this error a few times at startup is normal, but otherwise it is a concern.
  • Networking problems with some pods or nodes that become unreachable.
  • istioctl proxy-status showing STALE statuses in the output.
  • OOMKilled messages in the logs of a node.
  • Memory usage by containers: kubectl top pod POD_NAME --containers.
  • Memory usage by pods inside a node: kubectl top node my-node.
  • Envoy out of memory: kubectl get pods shows status OOMKilled in the output.

Sidecars take a long time to receive configuration

Slow configuration propagation can occur due to insufficient resources allocated to istiod or an excessively large cluster size.

There are several possible solutions to this problem:

  1. For in-cluster Cloud Service Mesh, if your monitoring tools (prometheus, stackdriver, etc.) show high utilization of a resource by istiod, increase the allocation of that resource, for example increase the CPU or memory limit of the istiod deployment. This is a temporary solution and we recommended that you investigate methods for reducing resource consumption.

  2. If you encounter this issue in a large cluster or deployment, reduce the amount of configuration state pushed to each proxy by configuring Sidecar resources.

  3. For in-cluster Cloud Service Mesh, if the problem persists, try horizontally scaling istiod.

  4. If all other troubleshooting steps fail to resolve the problem, report a bug detailing your deployment and the observed problems. Follow these steps to include a CPU/Memory profile in the bug report if possible, along with a detailed description of cluster size, number of pods, and number of services.