Google Open Source Blog

The Kubernetes ecosystem is a candy store

Monday, June 3, 2024

For the 10th anniversary of Kubernetes, I wanted to look at the ecosystem we created together.

I recently wrote about the pervasiveness and magnitude of the Kubernetes and CNCF ecosystem. This was the result of a deliberate flywheel. This is a diagram I used several years ago:

Flywheel diagram of Kubernetes and CNCF ecosystem

Because Kubernetes runs on public clouds, private clouds, on the edge, etc., it is attractive to developers and vendors to build solutions targeting its users. Most tools built for Kubernetes or integrated with Kubernetes can work across all those environments, whereas integrating directly with cloud providers directly entails individual work for each one. Thus, Kubernetes created a large addressable market with a comparatively lower cost to build.

We also deliberately encouraged open source contribution, to Kubernetes and to other projects. Many tools in the ecosystem, not just those in CNCF, are open source. This includes many tools built by Kubernetes users and tools built by vendors but were too small to be products, as well as those intended to be the cores of products. Developers built and/or wrote about solutions to problems they experienced or saw, and shared them with the community. This made Kubernetes more usable and more visible, which likely attracted more users.

Today, the result is that if you need a tool, extension, or off-the-shelf component for pretty much anything, you can probably find one compatible with Kubernetes rather than having to build it yourself, and it’s more likely that you can find one that works out of the box with Kubernetes than for your cloud provider. And often there are several options to choose from. I’ll just mention a few. Also, I want to give a shout out to Kubetools, which has a great list of Kubernetes tools that helped me discover a few new ones.

For example, if you’re an application developer whose application runs on Kubernetes, you can build and deploy with Skaffold, test it on Kubernetes locally with Minikube, or connect to Kubernetes remotely with Telepresence, or sync to a preview environment with Gitpod or Okteto. When you need to debug multiple instances, you can use kubetail to view the logs in real time.

To deploy to production, you can use GitOps tools like FluxCD, ArgoCD, or Google Cloud’s Config Sync. You can perform database migrations with Schemahero. To aggregate logs from your production deployments, you can use fluentbit. To monitor them, you have your pick of observability tools, including Prometheus, which was inspired by Google’s Borgmon tool similar to how Kubernetes was inspired by Borg, and which was the 2nd project accepted into the CNCF.

If your application needs to receive traffic from the Internet, you can use one of the many Ingress controllers or Gateway implementations to configure HTTPS routing, and cert-manager to obtain and renew the certificates. For mutual TLS and advanced routing, you can use a service mesh like Istio, and take advantage of it for progressive delivery using tools like Flagger.

If you have a more specialized type of workload to run, you can run event-driven workloads using Knative, batch workloads using Kueue, ML workflows using Kubeflow, and Kafka using Strimzi.

If you’re responsible for operating Kubernetes workloads, to monitor costs, there’s kubecost. To enforce policy constraints, there’s OPA Gatekeeper and Kyverno. For disaster recovery, you can use Velero. To debug permissions issues, there are RBAC tools. And, of course, there are AI-powered assistants.

You can manage infrastructure using Kubernetes, such as using Config Connector or Crossplane, so you don’t need to learn a different syntax and toolchain to do that.

There are tools with a retro experience like K9s and Ktop, fun tools like xlskubectl, and tools that are both retro and fun like Kubeinvaders.

If this makes you interested in migrating to Kubernetes, you can use a tool like move2kube or kompose.

This just scratched the surface of the great tools available for Kubernetes. I view the ecosystem as more of a candy store than as a hellscape. It can take time to discover, learn, and test these tools, but overall I believe they make the Kubernetes ecosystem more productive. To develop any one of these tools yourself would require a significant time investment.

I expect new tools to continue to emerge as the use cases for Kubernetes evolve and expand. I can’t wait to see what people come up with.

By Brian Grant, Distinguished Engineer, Google Cloud Developer Experience

Anomaly detection with few labeled samples under distribution mismatch

Thursday, May 30, 2024

SPADE: Semi-Supervised Anomaly Detection under Distribution Mismatch

What is SPADE?

Recently, we have open-sourced SPADE (Semi-supervised Pseudo-labeler Anomaly Detection with Ensembling), a semi-supervised framework for anomaly detection that overcomes some of the drawbacks of alternative anomaly detection methods.

What Problem does SPADE Solve?

Anomaly detection is the process of identifying samples in a dataset that diverge from some expected pattern. This process has wide applications in several industries such as API security, financial fraud and manufacturing defect detection. SPADE is especially designed for semi-supervised settings where we have a handful of labeled data and a large number of unlabeled data.

When is SPADE better for your Use Case?

Creating a large labeled set of anomalous and non-anomalous samples for supervised learning can be time-consuming, expensive and error-prone. So unsupervised and semi-supervised methods have become an active area of research.

Most of these semi-supervised methods make the assumption that the labeled and unlabeled data come from the same distribution, that is, they are generated by the same underlying process—physical, financial, manufacturing or other process. This assumption is often violated in different ways—the labeled data could contain one type of anomaly while the unlabeled data contains other types of anomalies; or the labeled data could only contain samples that were easy to label. In these and potentially other cases, SPADE has been shown to have better performance than alternatives.

How does it Work?

SPADE constructs an ensemble of One-Class Classifiers (OCCs); each OCC is a Gaussian Mixture Model trained in a self-supervised manner on a disjoint subset of the unlabeled samples and non-anomalous samples.

moving image of the process of SPADE training an ensemble of OCC, providing pseudo-labels, and then using both labeled and pseudo-labeled sampled to train a supervised model for anomaly detection

Figure 1. SPADE first trains an ensemble of OCC to provide pseudo-labels to the unlabeled samples. Then, both labeled and pseudo-labeled samples are used to train a supervised model for the anomaly detection.

The ensemble is used to obtain pseudo-labels for the unlabeled data. A pseudo-label of is-anomalous or not-anomalous is assigned only if all the members of the ensemble agree. The pseudo-labels and any original labels are used together to train a supervised anomaly detector model. In the version of SPADE that we are open-sourcing, this model is a Tensorflow Random Forest that is trained with a binary cross-entropy loss. Once trained on the labels and pseudo-labels, the detector model can be used for online or batch prediction.

Example Use Cases

The above described benefits of SPADE are highlighted in our experiments as detailed in the published paper (in TMLR with feature certification). Here we present some results on a selection of datasets that demonstrate SPADE performance when (a) there are new types of anomalies in the unlabeled dataset, (b) when the labeled anomalies are easy to label, and (c) when the dataset contains only positively labeled and unlabeled samples.

Graph showing SPADE performance compared against other supervised, semi-supervised and unsupervised methods.

Figure 2. SPADE performance compared against other supervised, semi-supervised and unsupervised methods. Details about the datasets and the methods can be found in our paper.

As shown in Figure 2, SPADE consistently outperforms alternative methods. The CoverType and Thyroid datasets have Creative Commons Attribution 4.0 International (CC BY 4.0) licenses and are present in the SPADE repository.

How to use SPADE

We have just open-sourced SPADE. The repository contains scripts that build a Docker container and push the container, then run the container as a Vertex Custom Job on Google Cloud Platform. The dataset is read from BigQuery. Metrics such as AUC, Precision and Recall can currently be tracked in the job logs. The job launch script is configured with a default set of hyperparameters as described in the documentation. Users may need to adjust the hyperparameters to obtain optimal performance. The final trained anomaly detection model artifact is written to Google Cloud Storage (GCS). This artifact can be deployed as a Vertex Endpoint to serve predictions (not demonstrated in this repository).

Ways to Help

By open sourcing SPADE, we hope to foster more usage of this innovative anomaly detection method in the community, as well as invite contributions to improve the method. The SPADE model and code is freely available on Github under the Apache-2.0 license. SPADE is currently set up to run in a Docker container as a Vertex Custom Job on Google Cloud Platform. It can also be run by installing from PyPi using pip install spade-anomaly-detection. Users can upload their dataset to BigQuery, and run the training job on Vertex, or on a local machine from the PyPi installation.

More detailed usage instructions are available in the documentation.

By Raj Sinha and Jinsung Yoon, Cloud AI Research Team

Kubernetes 1.30 is now available in GKE in record time

Friday, May 10, 2024

Kubernetes 1.30 is now available in the Google Kubernetes Engine (GKE) Rapid Channel less than 20 days after the OSS release! For more information about the content of Kubernetes 1.30, read the Kubernetes 1.30 Release Notes and the specific GKE 1.30 Release Notes.

Control Plane Improvements

We're excited to announce that ValidatingAdmissionPolicy graduates to GA in 1.30. This is an exciting feature that enables many admission webhooks to be replaced with policies defined using the Common Expression Language (CEL) and evaluated directly in the kube-apiserver. This feature benefits both extension authors and cluster administrators by dramatically simplifying the development and operation of admission extensions. Many existing webhooks may be migrated to validating admission policies. For webhooks not ready or able to migrate, Match Conditions may be added to webhook configurations using CEL rules to pre-filter requests to reduce webhooks invocations.

Validation Ratcheting makes CustomResourceDefinitions even safer and easier to manage. Prior to Kubernetes 1.30, when updating a custom resource, validation was required to pass for all fields, even fields not changed by the update. Now, with this feature, only fields changed in the custom resource by an update request must pass validation. This limits validation failures on update to the changed portion of the object, and reduces the risk of controllers getting stuck when a CustomResourceDefinition schema is changed, either accidentally or as part of an effort to increase the strictness of validation.

Aggregated Discovery graduates to GA in 1.30, dramatically improving the performance of clients, particularly kubectl, when fetching the API information needed for many common operations. Aggregated discovery reduces the fetch to a single request and allows caches to be kept up-to-date by offering ETags that clients can use to efficiently poll the server for changes.

Data Plane Improvements

Dynamic Resource Allocation (DRA) is an alpha Kubernetes feature added in 1.26 that enables flexibility in configuring, selecting, and allocating specialized devices for pods. Feedback from SIG Scheduling and SIG Autoscaling revealed that the design needed revisions to reduce scheduling latency and fragility, and to support cluster autoscaling. In 1.30, the community introduced a new alpha design, DRA Structured Parameters, which takes the first step towards these goals. This is still an alpha feature with a lot of changes expected in upcoming releases. The newly formed WG Device Management has a charter to improve device support in Kubernetes - with a focus on GPUs and similar hardware - and DRA is a key component of that support. Expect further enhancements to the design in another alpha in 1.31. The working group has a goal of releasing some aspects to beta in 1.32.

Kubernetes continues the effort of eliminating perma-beta features: functionality that has long been used in production, but still wasn’t marked as generally available. With this release, AppArmor support got some attention and got closer to the final being marked as GA.

There are also quality of life improvements in Kubernetes Data Plane. Many of them will be only noticeable for system administrators and not particularly helpful for GKE users. This release, however, a notable Sleep Action KEP entered beta stage and is available on GKE. It will now be easier to use slim images while allowing graceful connections draining, specifically for some flavors of nginx images.

Acknowledgements

We want to thank all the Googlers that provide their time, passion, talent and leadership to keep making Kubernetes the best container orchestration platform. From the features mentioned in this blog, we would like to mention especially: Googlers Cici Huang, Joe Betz, Jiahui Feng, Alex Zielenski, Jeffrey Ying, John Belamaric, Tim Hockin, Aldo Culquicondor, Jordan Liggitt, Kuba Tużnik, Sergey Kanzhelev, and Tim Allclair.

Posted by Federico Bongiovanni – Google Kubernetes Engine