KEP-3698: Multi-Network #3700

mskrocki · 2022-12-21T00:54:57Z

One-line PR description: Multi-Network

Issue link: Multi-Network #3698

keps/sig-network/3698-multi-network-requirements/README.md

phoracek · 2023-01-02T13:31:58Z

keps/sig-network/3698-multi-network-requirements/README.md

+a network
+ * Identify what “Object” is the Primary network
+ * Optional parameters: MAC address, IP address, speed, MTU, interface name etc.
+14. Every Pod connected to specific network (represented by the “Object”) must


Is this a necessary limitation for all "Objects"? Perhaps a network is meant only for north-south traffic. I wonder if this could be left to be defined by the implementation, similarly to the next point.

I think that would be too open. Today's K8s network has exactly this requirement, this way we standardize across all the networks.

Do you have a use case for such situation?

phoracek · 2023-01-02T13:35:11Z

keps/sig-network/3698-multi-network-requirements/README.md

+ the “Object”)
+15. Pods attached to a network are connected to each other in a manner defined
+by the “Object” implementation
+16. Basic network Interface information for each attachment will be exposed to


Is there a comparable downward API today? If not, I wonder if this could be deferred to a later phase. I'm concerned about implications of hot-plug or dynamic IPAM on this. It would also make this first phase a little leaner.

I dont think there is one. Fair point on maybe moving it to later phase. What is your concern in regards to hot-plug or dynamic IPAM?

phoracek · 2023-01-02T13:40:15Z

keps/sig-network/3698-multi-network-requirements/README.md

+16. Basic network Interface information for each attachment will be exposed to
+runtime Pod (via e.g. environment variables, downward API etc.)
+
+#### Phase II (scheduler, kubelet and API probing)


Would QoS and resource handling fit here? Since this phase is concerned with scheduler, maybe it should cover resource management too - if network would be in the same category as CPU or memory, it would suggest using the same scheduling/pooling mechanism

I think QoS and "network-resourcing" as a whole would be a new requirement. I would consider this as an future extension.

Shouldn't this be "implementation specific"? aka something your CNI plugin(s) deal with?

keps/sig-network/3698-multi-network-requirements/README.md

maiqueb

Thanks for this proposal @mskrocki .

keps/sig-network/3698-multi-network-requirements/README.md

danwinship · 2023-01-10T19:20:38Z

keps/sig-network/3698-multi-network-requirements/README.md

+ during cluster creation that is available to the Pod when no additional
+ networking configuration is provided.
+* **Primary network** - This is the network inside the Pod which interface is
+ used for the default gateway.


In the meetings I've been to, people keep getting confused about these two. It certainly doesn't help that the definition of "primary network" pretty much has to use the word "default".

You could call the first one the "Cluster-wide network"? Or "Traditional Kubernetes network"?

The additional networks can be "cluster-wide" and/or "Traditional Kubernetes networks". Naming is always tricky, do you have other recommendations?

I personally think you cannot avoid the "default" term in both scenarios due to historical semantics of networking. The "default network" could be "default instance" or other term based on naming of the object.
The "primary network" could be referred to as "primary attachment" where here the logic is to specify to which object the Pod "primary" interface must be attached

keps/sig-network/3698-multi-network-requirements/README.md

danwinship · 2023-01-10T20:49:36Z

keps/sig-network/3698-multi-network-requirements/README.md

+18. Kubelet network-based probing is optional for the “Object” connections to
+Pod


Optional for who?

I think what you mean is "if a Pod is not attached to the default cluster-wide network, then it is not defined whether network-based probes to that pod will work or not".

refactored.

danwinship · 2023-01-10T20:51:24Z

keps/sig-network/3698-multi-network-requirements/README.md

+19. “Object” connections to Pod are optionally able to connect to Kubernetes
+API - the Pods connections via non-default Pod network does not require access
+to Kubernetes API


What are '"Object" connections to Pod'? I'm imagining an object in etcd calling connect() somehow...

I guess you mean "Pods are not guaranteed to be able to reach the Kubernetes API via networks other than the cluster-default network" ?

refactored.

danwinship · 2023-01-10T20:52:02Z

keps/sig-network/3698-multi-network-requirements/README.md

+20. Kubernetes API can optionally reach to “Object” connections to
+Pod - Kubernetes API access to the Pod via non-default Pod network is not
+required


"The Kubernetes apiserver is not guaranteed to be able to reach pods via networks other than the cluster-default network"

refactored.

danwinship · 2023-01-10T20:58:23Z

keps/sig-network/3698-multi-network-requirements/README.md

+20. Kubernetes API can optionally reach to “Object” connections to
+Pod - Kubernetes API access to the Pod via non-default Pod network is not
+required
+


Phase II doesn't make sense as a separate phase. Each of the requirements 17-20 says "you don't have to do X". If Phase II is something above and beyond Phase I, then does that imply that in Phase I, you do have to do all those things? Eg, in Phase I, all networks must be available on all nodes, network probes have to work on all networks, etc, and then in Phase II we make it all optional?

I think Phase II needs to just be squashed into Phase I.

Unless everything in Phase II is kind of the opposite of what it says, and you really mean "in Phase I, network probes definitely can't be used on non-default networks but then in Phase II we will define an optional way to make it work". In which case, that needs to be more clear.

It is latter part, where we would add support for such capabilities. I have reworded the requirements to better reflect this.

danwinship · 2023-01-10T20:58:55Z

keps/sig-network/3698-multi-network-requirements/README.md

+21. “Object” connections to Pod are optionally able to provide Service,
+ NetworkPolicies functionality


"Connections to Pods over secondary networks" or whatever. '"Object" connections to Pod' is just weird.

refactored.

s1061123 · 2023-01-11T06:27:43Z

keps/sig-network/3698-multi-network-requirements/README.md

+12. The Pod reference to a Networks is optional and when NOT specified, Pod
+connects to “Default” “Object” (network the cluster has been created with)


Based on the discussion, #3700 (review), pod may NOT connect to "Default" network and it connect to "namespace-default" Network, instead of "Default".

This may changes current behavior and it violates the first requirement, 1. This effort shall not change the behavior of Today existing clusters, because existing cluster connects all pod in cluster network (I believe that is "Default" network in your KEP).

So could you please refine them to be consistent? (change requirement 12 and "namespace-deafult" network, or just remove requirement 1)

How the override of namespace-default network will work will be defined in future. This req is about the default behavior to explicitly satisfy req 1.

squeed · 2023-01-11T14:01:28Z

This all seems a bit too abstract. KEPs traditionally propose either some Kubernetes functionality or an API object. It reads like the charter to a working group rather than a KEP.

Is there a way this can be closer to a specific proposal? I'd like to see the API types :-).

mccv1r0 · 2023-01-11T15:36:32Z

When the dust settles we need to be able to have a k8s pod do the equivalent of what docker/podman can do today. (Apologies for those on the calls who already heard all this.)

For example, using podman, since it supports the same CNI as k8s, I can use:

sudo podman run --privileged --detach --name=exampleOne --network=public,dmz,storage,private quay.io/nginx:latest

This simple example puts 4 interfaces in the container. What I want to do with them doesn't need to be a concern of the apiserver. In each case I pick plugins, and the .conflist configures them to solve my problem. What I use and how they are configured are implementation specific.

As discussed, we will need a way to pass plugin specific info (static IP, static mac etc.) to each network/.conflist Most/all use cases I've seen in the doc can use docker/podman to demonstrate.

edwarnicke · 2023-01-19T16:17:41Z

keps/sig-network/3698-multi-network-requirements/README.md

+for performance purposes (high bandwidth, low latency), that my user-space
+application (e.g. DPDK-based) can use. The VF will not use the standard netdev
+kernel module. The Pod’s scheduling to the nodes should be based on hardware
+availability (e.g. devicePlugin or some other way).


Usually, when you dig into this, what you discover is that you have a workload that requires some sort of special treatment from the physical network and a fast path to it.

The problem is that this requirement is entirely mis-focused. Literally nobody wants an SRIOV VF. What they want is a fast path to the physical network doing a specific set of things for them.

You can tell this because

If you offer someone a VF leads to the physical network mishandling the packets pushed into the VF, they will be very unhappy.

If you can't offer them a SRIOV VF because your hardware only supports SIOV (nexts technology coming after SRIOV) they will be very unhappy (even if you could have given them a vfio for the SIOV that is connected to the proper physical network treatment)

If you offer them vfio for the VF when they needed a kernel interface, they will be unhappy.

All of this becomes even more crucial when you are scheduling the Pod. If I schedule the Pod to a Node that just has SRIOV VFs available, but its NICs lack the required capabilities, or cannot provide a fast path to the required sort of treatment, this all breaks down.

There are important sets of requirements in here... but I would recommend thinking them through outside of the 'interface' 'network' paradigm being applied here to get to the real root of the need.

I'd suggest rethinking the requirement in terms of:

What Network Service does the user want from the physical network?

How do they want to consume it? Via what mechanism? (kernel interface, rdma, vfio, etc)

What capabilities do they need from the fast path connection to that Network Service?

edwarnicke · 2023-01-19T16:18:38Z

keps/sig-network/3698-multi-network-requirements/README.md

+#### Story #3
+I have implemented my Kubernetes cluster networking using a virtual switch. In
+this implementation I am capable of creating isolated Networks. I need a means
+to express to which Network my workloads connect to.


@danwinship Agreed. This requirement feels like it would be covered by a properly written 'traffic segregation' requirement.

edwarnicke · 2023-01-19T16:22:06Z

keps/sig-network/3698-multi-network-requirements/README.md

+As a Virtual Machine -based compute platform provider that I run on top of
+Kubernetes and Kubevirt I require multi-tenancy. The isolation has to be
+achieved on Layer-2 for security reasons.
+


Why does this have to be achieved at L2 for security reasons? That's an oddly specific requirement. What security reasons?

This is particularly worrisome as K8s carefully avoids having any L2 concepts in it. It would be unfortunate to re-introduce them for such sparse requirements.

This use case is specific on purpose (to concur some of the above comments about use cases being too vague). This is NOT about adding L2/L3 notions into k8s, its about handling such use case. Checkout this sections top description, that any technology or product mentioned here does NOT mean it will be supported, and this is reflected in below requirements where none is talking about supporting specific layer or product.

edwarnicke · 2023-01-19T16:23:12Z

keps/sig-network/3698-multi-network-requirements/README.md

+As a platform operator I need to connect my on-premise networks to my workload
+Pods. I need to have the ability to represent these networks in my Kubernetes
+cluster in such a way that I can easily use them in my workloads.
+


Could we get more clarity on this case? There are lots of ways of achieving this already using BGP in many existing CNI plugins. Is this perhaps another traffic segregation case?

Challenge ... most existing CNI plugins do it by specific CRDs which do not necessarily coexist well. Purpose is to offer a solution where K8S could support such implementations without specific CRDs.

Doing so, the challenge of supporting workloads and avoid the "platform certification" mess might get/should be simpler.

edwarnicke · 2023-01-19T16:26:15Z

keps/sig-network/3698-multi-network-requirements/README.md

+“profiles” for Namespace, where I can not only change default Network,
+but define a set of Networks assigned to given Namespace that Pods created in
+that NS are automatically attached to.
+


OK... so there's a desire here to isolate workloads (traffic segregation). This time by namespace. It sounds like in this case there's a desire to have a segregated primary network by namespace, but its unclear to me from the requirement as written why this would imply a 'set of Networks'

It feels like this may be smashing two requirements together:

A desire to have per-namespace traffic segmentation for primary networks. If so this opens interesting questions about how that interacts with the rest of K8s networking

A desire to add a set of networks by namespace to a Pod... but its unclear why in this requirement.

edwarnicke · 2023-01-19T16:29:55Z

keps/sig-network/3698-multi-network-requirements/README.md

+2. We need to introduce an “Object” that represents the existing 
+infrastructure’s networking


Question: Why? Why do we need to introduce an Object here? This feels more like the 'solution' side of things than the 'requirement' side of things.

As engineers, we all have a serious tendency to decide on a solution and then decide that solution is a requirement. We all do it (I more than most). This screams that here. Lets get the real requirements underpinning this proposed solution.

edwarnicke · 2023-01-19T16:30:15Z

keps/sig-network/3698-multi-network-requirements/README.md

+4. The “Object” shall not define any implementation specific parameters in that
+object


I am also confused.

edwarnicke · 2023-01-19T16:35:26Z

keps/sig-network/3698-multi-network-requirements/README.md

+5. “Object” shall provide option to define:
+ * IPAM mode: external/internal
+ * List of route prefixes - optional and not forced on the implementations


In addition to @danwinship 's points here (which I support)... how are we handling or preventing IPAM that conflicts with the Cluster/Service CIDRs and routes that highjack Cluster/Service CIDRs?

edwarnicke · 2023-01-19T16:47:16Z

keps/sig-network/3698-multi-network-requirements/README.md

+5. “Object” shall provide option to define:
+ * IPAM mode: external/internal
+ * List of route prefixes - optional and not forced on the implementations


On other point to make: by putting this 'Object' in the K8s API Server, it intrinsically is moved out of the world of the people actually administering the network and into the world of people administering the cluster. Generally speaking in my experience this will work some of the time, break horribly some of the time, and work with a lot of pain for all involved some of the time, depending on the interactions between the K8s administrators and the folks administering the physical network involved.

Having the flexibility to move this point of control around is going to be crucial, and that argues against it necessarily being required to live in the K8s API server.

keps/sig-network/3698-multi-network-requirements/README.md

shaneutt · 2023-03-30T16:26:01Z

/cc

keps/sig-network/3698-multi-network-requirements/README.md

shaneutt · 2023-03-30T18:11:31Z

keps/sig-network/3698-multi-network-requirements/README.md

+for performance purposes (high bandwidth, low latency), that my user-space
+application (e.g. DPDK-based) can use. The VF will not use the standard netdev
+kernel module. The Pod’s scheduling to the nodes should be based on hardware
+availability (e.g. devicePlugin or some other way).


I agree with Dan that this is a particularly compelling user story. As for SRIOV VF I get that this was used as an example and so focusing in on it specifically might be distracting, but I also think it might be worth taking a couple seconds to add a couple more examples just to moot the point.

s1061123 · 2023-05-09T14:25:30Z

keps/sig-network/3698-multi-network-requirements/README.md

+2. We need an API object representing network, which a Pod, a Service or any
+other feature that wishes to support Multi-Network capability could reference


Still now I am not quite clear that why we need this object, as @edwarnicke adds in comment

Could you please elaborate it to clarify:

What the 'object' archieved to Kubernetes users?

Why the 'object' is required for users?

Without the 'object', what is not archieved?

Based on the current Kubernetes community situation, multus, network service mesh and other network components, outside of Kubernetes core, do achieve multi network without core network objects.

Today's implementation are creating CRDs, that have to be referenced via annotations or other in-direct means. Annotation is not meant to define configuration of an object, additionally there is no validation of the user input.
On top of that every implementation has its own CRDs and way of configuration, so there is no one standard.

Today's implementation are creating CRDs, that have to be referenced via annotations or other in-direct means.

Some network implementation uses CRDs, but not all. For example, flannel does not use CRDs. In addition, each implementation has several CRDs to specify network but it is depends on the implementation. Some implementation use a CRD for 'network', but some implementation use CRDs for 'network IPAM' and 'network transport (i.e. SDN configuration)'.

Your design assumes that one network object is required for everyone, for each implementation, but actually it is not, I guess. So please clarify and give a reason why we need the object. Currently the reason seems to be ambiguous.

The main reason is to have a handle/representation of "network" Pod connects/attaches to. This does NOT exists Today. This is mentioned in the Summary and Motivation, maybe it is not sufficient. I will add more there.

shaneutt · 2023-05-10T16:43:19Z

keps/sig-network/3698-multi-network-requirements/README.md

+
+### Goals
+
+Define user stories and requirements for the Multi-Network effort in Kubernetes.


At today's multi-network sync (May 10th, 2023 - notes available here) there was a conversation involving several members of the community regarding the desire for this KEP to better define standards for defining additional parameters of a network interface for a Pod (perhaps particularly to help support DRA and QoS configurations) which may be implementation-specific, or perhaps generic enough for all implementations as an explicit goal of this KEP.

Personally I agree with the high level goal, but I think we should avoid trying to put new things in core if possible for a variety of long term maintenance reasons. There are other takes on this which I invite conversation to 🧵 here about.

Ultimately, I would like to see that conversation continued and resolved before we merge this KEP and the goals here adjusted accordingly.

cc @marquiz (re #3004)

Thanks @shaneutt. I need to read this KEP with thought, but ulitmately I think that DRA (if any) would be in line with requirements. The QoS resources KEP (#3004) is something that could possibly be used to cover some current usage scenarios involving e.g. annotations. But it's good to understand what it is and how it's different.

cc @pohly

My thoughts are that there are two places where we might have arbitrary config.

At the network level. For example, parameters specifying what VLAN ID this network has, or what IP address ranges it is allowed to use.

At the network interface level. For example, parameters specifying the QoS for a particular interface, or the allowed bandwidth, or the required failure domain to allow redundancy.

These are all going to be implementation dependent, and we shouldn't put them into core Kubernetes. The easiest way to do this is to have a CR with whatever specific implementation specific horrors are required in it, and allow the standard K8s objects to have a reference to that CR (with it being entirely up to the implementation what to expect in the CR and what to do with it).

There is already a reference to a CR in the PodNetwork resource. My takeaway from the discussion was that the PodNetworkAttachment resource probably needs exactly the same capability.

I'm open to some common fields making their way into the PodNetworkAttachment (say MAC address or even IP as well as interface name or similar), but I can see that turning into something pretty ugly pretty fast if we aren't careful, so shoving as much as possible into a CR seems safer. I'm also open to this being "something we add in a later phase, but agree we expect to do later". Finally, I think the alternative is some kind of annotation model, where we optionally list interfaces and references to their config in annotations in the pod, which is plausible; a bit ugly but also less intrusive to core K8s.

As this is a requirements KEP, we should try to avoid blocking it on other KEPs unless we think it is fundamentally a non-starter for the community. The cost paid for work on refining requirements is community time rather than long-term tech debt.

keps/sig-network/3698-multi-network-requirements/README.md

bowei · 2023-05-31T16:07:57Z

keps/sig-network/3698-multi-network-requirements/README.md

+* **Cluster Default PodNetwork** - This is the initial cluster-wide PodNetwork
+ provided during cluster creation that is available to the Pod when no
+ additional networking configuration is provided in Pod spec.
+* **Primary PodNetwork** - This is the PodNetwork inside the Pod which interface


We should be specific about the cardinality relationship between PodNetwork and interfaces. Is it always 1-1? If not, we will need to be very careful whenever one is mentioned vs another.

I assume we will probably land up with either a scope field in the API or different APIs (Kinds) for each of these use-cases ? Wondering if the following nomenclature pattern is more intuitive.

Cluster Default PodNetwork

Namespace Default PodNetwork (For future expansion, note Default is not the name of the namespace, this is the default network associated with a namespace)

Default PodNetwork (instead of primary network, this is the one assigned to the pod, either by policy or by the user picking one).

@bowei The cardinality relationship is going to be defined in the Phase I. I dont want to mention that in this phase.

@pvenu-Google using "Default PodNetwork" when we already have "Cluster Default PodNetwork" will not be distinguishable, most folks will keep skipping the "Cluster" word which will lead to confusion. And we want to keep the "Default" PodNetwork for what it is now to align with the "default" namespace.

k8s-ci-robot · 2024-02-16T22:26:46Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: mskrocki
Once this PR has been reviewed and has the lgtm label, please assign thockin for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

keps/sig-network/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

aojea · 2024-02-16T23:02:34Z

keps/sig-network/3698-multi-network/README.md

+ * We will introduce an PodNetwork API object that represents the existing
+ infrastructure’s networking.


How do you solve the boostrap problem, Nodes are created and then Pods, when is the PodNetwork object created and by whom?
what is the lifecycle of the PodNetwork object?
how does it interact with the lifecycle of the Nodes and Pods objects?
How is the infrastructure represented?
How does it handle network partitions in the infrastructure? What happens if NodeA and talk with Node B but not viceversa?

Most of these are answered below. Regarding

How does it handle network partitions in the infrastructure? What happens if NodeA and talk with Node B but not viceversa?

This KEP will not define/explain any infrastructure relations. The PodNetwork is just an abstraction for that, and it is up to the implementation on how strongly it is coupled with the infra.

keps/sig-network/3698-multi-network/README.md

uablrek · 2024-02-17T06:56:58Z

Why all the use-cases?

As I see it, all of them are implementation specific. For the API extension, which the KEP is about, it would be sufficient with a use-case #0, like:

As an advanced network user I would like to add various network interfaces to various PODs for various reasons

uablrek · 2024-02-17T07:06:44Z

I miss a detailed description of the clean-up of all PodNetwork attachments in a POD when it dies. In my experience it is always much, much harder to delete things than to add them. As for now the CRI-plugin will call the CNI-plugin with a DEL command. With Multus, it's the responsibility of Multus to do a DEL on all it's sub-cni-plugins.

But who will do this in K8s multi-networking? Kubelet? If so, will KNI become a prerequisite?

Will this part be skipped? If so, it will be a requirement on implementations to garbage-collect resources such as IP addresses.

keps/sig-network/3698-multi-network/README.md

danwinship

Yeah, I think the big picture here is correct.

danwinship · 2024-02-24T16:10:16Z

keps/sig-network/3698-multi-network/README.md

+ interfaces
+* applications leveraging performance-oriented interfaces (e.g. `AF_XDP`,
+ `memif`, `SR-IOV`)
+* applications requiring support for protocols not yet supported by Kubernetes


I'm not sure what this means... If you only care about supporting the protocol on the primary network interface then isn't it just a matter of selecting a primary network plugin that supports that protocol? (I'm not sure what level of "protocol" you're referring to, but SCTP and multicast are examples here: k8s conformance doesn't require either of those, and many plugins don't allow them, but some do.)

I do mean Networking protocols e.g. multicast as you mentioned. Agree that you could change primary CNI and support any protocol, but what if I dont want to touch my default Network configuration but still add support for the new protocol. With MN I can just create new PodNetwork that would use the additional/new CNI, and it only for those Pods that needs it.

danwinship · 2024-02-24T16:19:01Z

keps/sig-network/3698-multi-network/README.md

+* Introduce API object to kubernetes describing networks Pod can attach to.
+* Evolve current Kubernetes networking model to support multiple networks,
+defining the new model in a backwards compatible way.


I think there's increasing consensus that we ought to have a better network model, and objects that describe networks, even in the single-network case (eg "KNI").

I'm not sure what the right way to talk about this in the KEP, since KNI is even more pre-alpha than Multi-Network. But I think it's important to be thinking of this as not just about multiple networks, but also about having a better (and more k8s-native) model of pod networking in general, which in particular supports multiple networks.

danwinship · 2024-02-24T16:26:29Z

keps/sig-network/3698-multi-network/README.md

+
+### Non-Goals
+
+Define the CNI implementation.


Not exactly sure what you mean here (even after finishing reading the KEP).

I meant to say here that this KEP does not specify how to implement the CNI, but just provide API abstraction/handler.

keps/sig-network/3698-multi-network/README.md

danwinship · 2024-02-24T16:44:57Z

keps/sig-network/3698-multi-network/README.md

+for performance purposes (high bandwidth, low latency), that my user-space
+application (e.g. DPDK-based) can use. The VF will not use the standard netdev
+kernel module. The Pod’s scheduling to the nodes should be based on hardware
+availability (e.g. devicePlugin or some other way). This interface might not


Not clear if you mean "k8s will implicitly use a devicePlugin or something to handle scheduling" or "the user will need to explicitly think about devicePlugins to get scheduling to work right".

(The latter is how it works in Multus; when using SR-IOV, it's not enough to just request the SR-IOV NetworkAttachmentDefinition, you have to also request an SR-IOV resource. But this is annoying and makes SR-IOV NADs work differently from other kinds of NADs (eg, ipvlan-based). Of course, Multus can't make scheduling happen automatically because the scheduler doesn't know anything about Multus, but a k8s-native API ought to be able to do better here.)

This exact functionality is described in DRA https://docs.google.com/document/d/1XNkTobkyz-MyXhidhTp5RfbMsM-uRCWDoflUMqNcYTk/edit?disco=AAAA9O0je4I , I don't think we should duplicate functionalities , we need to understand what is missing in DRA (that is being revisited these days) and suggest the necessary changes if we can reuse that mechanism for this

danwinship · 2024-02-24T19:47:11Z

keps/sig-network/3698-multi-network/README.md

+ // Examples: eth1 or net1
+ //
+ // +optional
+ InterfaceName string `json:"interfaceName,omitempty"`


Why does it make sense to specify this here rather than as a ParametersRef? Particularly given that it's not even meaningful for some network types (eg, the bandwidth user story).

This is to ensure the name is unique across the list of Networks specified. Otherwise there is no control, especially if I will attach 2 PodNetworks from 2 different vendors.

danwinship · 2024-02-24T19:49:25Z

keps/sig-network/3698-multi-network/README.md

+ // for it.
+ //
+ // +optional
+ IsDefaultGW4 bool `json:"isDefaultGW4,omitempty"`


Likewise, why does it make sense to specify this here rather than as a ParametersRef? It seems unlikely that for a given PodNetwork, some pods would want to have it as their default route and others would not. (Though of course, if that were they case, they could just use different PodNetworkAttachments.)

in particular, having separate IPv4 and IPv6 default routes is a really niche use case

(as with IPAM, I feel like the eventual solution here is a standard parameter type)

same as with interfaceName, how do I coordinate 2 PodNetworks that both indicate IsDefault?

agree with the v4 v6 split, let me remove that and if need be we can make it more complex in future.

keps/sig-network/3698-multi-network/README.md

danwinship · 2024-02-25T00:12:42Z

keps/sig-network/3698-multi-network/README.md

+We have considered using classes (similar to GatewayClass etc.), but we do not
+expect any extra action for a PodNetwork to take place for specific implementation.
+PodNetwork already points to a custom resource, which implementation can hook on
+for any specific extra configuration. At this point of time we consider using
+classes as overkill, and in future if such need arises, we can reuse the provider
+field for that purpose.


So, I think your model of how this works is that there's a single component in the cluster responsible for all PodNetworks.

But another model would be to allow multiple PodNetwork-managing components in the cluster, each supporting different functionality, and managing only the PodNetworks of their own PodNetworkClass. For example, there might be separate PodNetworkClasses implementing each of the User Stories (though there could also be "heavyweight" multi-use-case PodNetworkClasses like the implied universal PodNetworkClass currently envisioned by the KEP).

I think the PodNetworkClass model would lead to more reused code and more common functionality across cluster types. Given the model of the current KEP, it's possible (indeed, likely) that, say, GKE and OpenShift could both provide ways of implementing all of the User Stories, but those ways would be entirely incompatible, involving different Provider names and ParametersRefs of completely different CRD types. Whereas with the PodNetworkClass model, GKE and OpenShift might each have some PodNetworkClasses of their own, but could also make use of "generic" ones (possibly ones written by users themselves).

The multiple implementation in cluster suppose to be handled by the provider field, assuming the implementations support that field, and this makes the interoperability between implementation dependent on each other.

With class this is enforced if we make that field mandatory. Will bring this up in next sync meeting and talk it thru.

danwinship · 2024-02-25T00:17:31Z

keps/sig-network/3698-multi-network/README.md

+#### Story #8
+As Cluster operator I wish to manage my cluster’s bandwidth usage on a per-pod
+basis, simultaneously preserving all other Pod networking configuration the same
+across those Pods.


The more I think about this, the less it seems to fit with this KEP.

Or at least, it doesn't fit as a PodNetwork. This seems more like a PodNetworkAttachment that points to the default PodNetwork but also includes a ParametersRef pointing to some well-known parameter type containing bandwidth information...

(And maybe that's actually what you meant? I guess it's not clear from the story itself...)

@danwinship you have references to PodNetworkAttachment as in different places, and if for that you mean the representation on a Pod of a network interface https://docs.kernel.org/networking/netdevices.html, then this is the part I agree 100% we need to solve first.

And this is what actually DRA is solving #3063 , it seems that the API is still revisiting and based on my current reading it can accomodate perfectly the use cases of PodNetworkAttachment

Once we have this capabilities we can use them to build the other high level abstractions required in this KEP

aojea · 2024-02-25T21:53:07Z

keps/sig-network/3698-multi-network/README.md

+rule of: single Pod references a given PodNetwork only 1 time
+
+#### Auto-population
+When networks field is not set, and hostNetwork is not set, we will


what happens with pods with hostNetwork: true? to which network do they belong?

None, MN does not apply to HN pods.

so how does this work then, if I have 3 pods, podA in networkA pod B in networkB and podC in host network, which Pods are able to communicate with podC?

Because Today there is only 1 network, root ns and Pod ns is treated as one. And we need to still assume that, and we call this a "default" Network, pods connected to that Network has to be backward compatible. Any other networks behavior is up the implementation. For some, isolation is a factor for some it is not, we cannot enforce that.

aojea · 2024-02-25T21:54:39Z

keps/sig-network/3698-multi-network/README.md

+The above status is expected to be populated by kubelet, but this can only happen
+after CRI provides support for the new Pod API. Because of that, initially
+kubelet will behave as it does today, without updating the additional fields.
+Until CRI catches up, the PodNetwork providers will be able to update that field


we do not merge split functionality, CRI API changes has to be done at the same time

CRI changes has to come with container runtime changes. Considering all the functions of CRI take Pod as an argument, the required changes are there from my point of view.

Considering all the functions of CRI take Pod as an argument, the required changes are there from my point of view.

Making changes on the Pod API does not mean runtimes will adopts these changes

And that is fine, I have agent-based CNI, that does not care about them.

keps/sig-network/3698-multi-network/README.md

aojea · 2024-02-25T22:00:02Z

keps/sig-network/3698-multi-network/README.md

+Considering all above changes are in the direct scope of CRI, this KEP will not
+propose complete changes for them, and a separate KEP will be created to cover it.
+Below are some suggestions on what the changes could look like.


if those changes are required for this KEP they must be in this KEP, or are you suggesting we merge some changes first and the others later?
the changes merged are going to work without the other changes? we can have merged something that is not functional

Yes, I would see that we introduce this API as first step. Then container runtimes catch up and adjust to the new fields Pod has. Otherwise the change just grows too large. Lastly I can solve this with agent-based CNI Today, that does not rely on CRI, until they catch up.

Yes, I would see that we introduce this API as first step. Then container runtimes catch up and adjust to the new fields Pod has.

I think you need to provide more details on this, how much do you estimate container runtime catch up on features?

IDK, considering KNI proposal is on the table as well, maybe then it will be independent. My point is, I do not want to make dependency on container runtimes.

aojea · 2024-02-25T22:09:03Z

keps/sig-network/3698-multi-network/README.md

+### Endpointslice controller changes
+Considering changes to Pod.PodStatus.PodIPs list, we must ensure that the
+controller is using the correct IPs when creating the endpoints. We will ensure
+that only IPs for "defautl" PodNetwork will be used.


This requires a much deeper explanation as this is the core functionality of Services and in extension Ingress and Gateway API. Services are namespaced, so it seems that there can be Services that has Pods in different networks, - how are the EndpointSlice controller going to handle these Services?

If there can be overlapping IPs, and some of the Pods in the service belong to different networks with overlapping, how this will be handled?

Where is the ClusterIP going to be present, in all networks or only in one? which one?

For headless services, how DNS will be used in each network? How the resolv.conf of the Pods will look like, specially on multi homed pods?

Webhooks use Services and have to be connected from the apiserver, if overlapping IPs is allowed, how can you guarantee that the apiserver is able to get to all the pods in all networks? it must be modified to be network aware ?

same for the kubernetes.default service, it has to be reachable from all networks, how the endpoint reconciler in the apiserver will choose the endpoints for all networks?

This KEP does not introduces support for Services. Based on @danwinship feedback, we cannot reuse PodIPs field in status anyway, and have to add new field, which will not affect this controller, I have removed this section.

aojea · 2024-02-25T22:09:34Z

keps/sig-network/3698-multi-network/README.md

+that only IPs for "defautl" PodNetwork will be used.
+
+### Kubelet changes
+We will introduce an additional check for kubelet readiness for networking.


you need to handle also the resolv.conf and all the dns configuration in the kubelet

Is this resolv.conf of the node or one that Pods gets? I assume Node's one is handled by the platform, and DNS for Pods is not in-scope for this Phase.

Lastly, based on @danwinship feedback above, I realized that having scheduler guarding Pods from being send to Node before required PodNetwork is present covers this functionality, I removed this section.

aojea · 2024-02-25T22:18:29Z

keps/sig-network/3698-multi-network/README.md

+
+### Scheduler changes
+These are the changes we will do in Pod scheduler:
+* Provide active Pod spec validation


can you expand on what are those active pod spec validation?

described above, will provide a link in the text

aojea · 2024-02-25T22:19:37Z

keps/sig-network/3698-multi-network/README.md

+These are the changes we will do in Pod scheduler:
+* Provide active Pod spec validation
+
+When one of the multi-network validation fails, scheduler will follow the current


This will require sig-scheduling approval and also provide more detail, specially about the possible implications on performance and autoscaling
cc: @alculquicondor

Where does this validation happen?

Are you adding any new scheduling plugins?

TBH I have not investigated this deeper. This is just a description of what I would like to achieve on the scheduler side. If I can just do this with a plugin, then yes. Will follow up.

aojea · 2024-02-25T22:22:44Z

keps/sig-network/3698-multi-network/README.md

+#### Pod Creation
+The Pod creation is handled by the SyncPod function, which calls RunPodSandbox
+([code](https://github.com/kubernetes/cri-api/blob/release-1.28/pkg/apis/runtime/v1/api.proto#L40))
+CRI API. The parameters for that function are defined by PodSandboxConfig ([code](https://github.com/kubernetes/cri-api/blob/release-1.28/pkg/apis/runtime/v1/api.pb.go#L1343)).


You need approval from sig node @SergeyKanzhelev , and from container runtimes , at least containerd and crio

aojea · 2024-02-25T22:41:19Z

My summary:

user stories does not define clearly the user problems, we should understand better the user problem as some of them may be solved in a more Kubernetes way instead of having to replicate all the virtual network complexity in kubernetes itself.

Per example, some user stories I heard for using Pods with an additional external interface was to implement bgp against an upstream router to implement fast failover, when I asked why they will not use probes they said that it was because they were slow, but they will prefer to use probes if they can operate in subseconds, so that can be solved with #3066

Another comment request that we never solved is Support port ranges or whole IPs in services #23864 , to be able to assign IPs to Pods, and then people has to add an external interface to the Pod for that.

the list of requirements needs a more clear justification, is not easy to get the relation with the user stories
the new network objects seems to implement some kind of logical partitioning at the network level, however, Kubernetes uses namespaces as the logical partitioning https://kubernetes.io/blog/2016/08/kubernetes-namespaces-use-cases-insights/ , This may imply we'll have a logical partition and a physical partition via the network, but the physical partition will not be easy to observer by the users, as it seems possible to have pods with multiple networks in the same namespace, this does not look some nice UX, also opens a lot of doubts about the feasibility of this change
I miss a lot details about the overall behavior of the cluster, NetworkPolicies, Services, Webhooks, the kubernetes.default Service, DNS, kubectl exec, port-forward, the KEP has to define what will be the behavior of this core functionalities of kubernetes
There are external dependencies, SIG Network can drive the KEP but we can not impose changes to other sigs, there are at list two SIGs: Node and Scheduling that will need to approve this changes, and I expect API Machinery and Scalability and Architecture to get involved, as webhooks and kubernetes.default at least are going to be impacted, also it seems we'll need the implementations on the container runtimes, we can not merge API objects and wait several releases for the runtimes to implement the changes

danwinship · 2024-02-25T23:26:05Z

the new network objects seems to implement some kind of logical partitioning at the network level, however, Kubernetes uses namespaces as the logical partitioning

Namespaces are not the only unit of logical partitioning in Kubernetes. For example, Pods are also partitioned by Node (Local traffic policy) and by zone (topology-aware routing).

aojea · 2024-02-25T23:41:51Z

Pods are also partitioned by Node (Local traffic policy) and by zone (topology-aware routing).

at the Service level, not at the pod level

aojea · 2024-03-03T13:24:05Z

keps/sig-network/3698-multi-network/README.md

+
+N/A
+
+## Alternatives


Some references to projects that implement a network oriented architecture in kubernetes and/or with containers, we should also understand the lessons learned from them and why, if these projects provide the same functionality, are not being used today? The alternatives section should consider them

https://github.com/openshift/kuryr-kubernetes
https://docs.docker.com/network/
https://kubevirt.io/user-guide/virtual_machines/interfaces_and_networks/
https://events19.linuxfoundation.cn/wp-content/uploads/2017/11/Multiple-Networks-and-Isolation-in-Kubernetes-_Michael-Xie.pdf

Those are specific implementation, not sure how these are alternatives? Per @danwinship feedback I moved DRA discussion here.

if those specific implementation already solve this problem and are available and working, specially the kubevirt one, we need to understand why do we discard them. ... DRA is a piece of what you are proposing, you are proposing a full multinetowrk architecture to support kubernets as an IaaS ... kubevirt solves this problems and already works, we need to understand why kubevirt is not a valid option and why we need to make it in kubernetes

I think you are missing main point of this KEP, we do not propose a solution for MN implementation but a common API all these implementation you listed could use. All of these have their own CRDs and ways to reference them in the Pod, this KEP is trying to standardize the reference aspect ONLY.

l1b0k · 2024-03-05T03:55:41Z

keps/sig-network/3698-multi-network/README.md

+* For specific PodNetwork a Pod can either reference that PodNetwork or
+PodNetworkAttachment (referencing that PodNetwork), but not both at the same time
+* PodNetwork can be referenced by multiple Pods
+* PodNetworkAttachment can be referenced by multiple Pods


In multi-tenancy situations, Pod dimension configuration is usually different.
So that the PodNetworkAttachment data may be 1:1 with the number of Pods.

This will cause a lot of overhead.

If PodNetworkAttachment is a single Pod dimension configuration, there is no need to be referenced by multiple Pods.

If the network definitions are different, multiple PodNetworks are used to plan the network plane.

This is to provide flexibility to the implementations. One can support 1:1 the other 1:n.

We can not regress in scalability and performance, @l1b0k question and comment is spot on, we need to understand the impact of this overhead, we can not sacrifice performance by flexibility without having a real understanding on how much penalty will cause this flexibility , and what is the benefit of this flexibility ... open a door for multiple implementations to behave different is not a good answer if this means users will have different behaviors on different places for the same cluster and APIs

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Dec 21, 2022

k8s-ci-robot requested review from caseydavenport and dcbw December 21, 2022 00:55

k8s-ci-robot added kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/network Categorizes an issue or PR as relevant to SIG Network. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Dec 21, 2022

zvonkok mentioned this pull request Dec 21, 2022

[RFC] Direct Attachable CNIs For Kata Containers kata-containers/kata-containers#1922

Closed

s1061123 reviewed Dec 21, 2022

View reviewed changes

keps/sig-network/3698-multi-network-requirements/README.md Outdated Show resolved Hide resolved

mskrocki force-pushed the mn-req branch from d3e83f9 to 217b02a Compare December 21, 2022 17:53

electrocucaracha mentioned this pull request Dec 21, 2022

Requirements for multi-interface lfn-cnti/bestpractices#211

Open

phoracek reviewed Jan 2, 2023

View reviewed changes

s1061123 reviewed Jan 4, 2023

View reviewed changes

keps/sig-network/3698-multi-network-requirements/README.md Outdated Show resolved Hide resolved

maiqueb reviewed Jan 5, 2023

View reviewed changes

keps/sig-network/3698-multi-network-requirements/README.md Outdated Show resolved Hide resolved

keps/sig-network/3698-multi-network-requirements/README.md Outdated Show resolved Hide resolved

danwinship reviewed Jan 10, 2023

View reviewed changes

s1061123 reviewed Jan 11, 2023

View reviewed changes

edwarnicke reviewed Jan 19, 2023

View reviewed changes

moshe010 mentioned this pull request Jan 26, 2023

KEP-3695: extend pod resource API to for DRA #3738

Merged

uablrek mentioned this pull request Feb 5, 2023

PRR for 2593-multiple-cluster-cidrs #3839

Merged

toelke reviewed Feb 28, 2023

View reviewed changes

keps/sig-network/3698-multi-network-requirements/README.md Outdated Show resolved Hide resolved

keps/sig-network/3698-multi-network-requirements/README.md Outdated Show resolved Hide resolved

Bernsrdo333 reviewed Mar 13, 2023

View reviewed changes

keps/sig-network/3698-multi-network-requirements/README.md Outdated Show resolved Hide resolved

k8s-ci-robot requested a review from shaneutt March 30, 2023 16:26

shaneutt reviewed Mar 30, 2023

View reviewed changes

s1061123 reviewed May 9, 2023

View reviewed changes

shaneutt reviewed May 10, 2023

View reviewed changes

bowei reviewed May 31, 2023

View reviewed changes

keps/sig-network/3698-multi-network-requirements/README.md Outdated Show resolved Hide resolved

bowei reviewed May 31, 2023

View reviewed changes

mskrocki force-pushed the mn-req branch 2 times, most recently from 3ceb22e to 78f7523 Compare December 1, 2023 18:50

uablrek mentioned this pull request Jan 10, 2024

nftables kube-proxy TODO kubernetes/kubernetes#122572

Open

27 tasks

mskrocki force-pushed the mn-req branch from 78f7523 to cdf85ae Compare February 16, 2024 22:26

KEP-3698: Multi-Network

1ce7bea

mskrocki force-pushed the mn-req branch from cdf85ae to 1ce7bea Compare February 16, 2024 22:37

aojea reviewed Feb 16, 2024

View reviewed changes

uablrek reviewed Feb 17, 2024

View reviewed changes

keps/sig-network/3698-multi-network/README.md Show resolved Hide resolved

uablrek reviewed Feb 17, 2024

View reviewed changes

keps/sig-network/3698-multi-network/README.md Outdated Show resolved Hide resolved

uablrek mentioned this pull request Feb 23, 2024

KNI [Kubernetes Networking Interface] Initial Draft KEP #4477

Draft

danwinship reviewed Feb 24, 2024

View reviewed changes

danwinship reviewed Feb 25, 2024

View reviewed changes

aojea reviewed Feb 25, 2024

View reviewed changes

keps/sig-network/3698-multi-network/README.md Outdated Show resolved Hide resolved

aojea reviewed Feb 25, 2024

View reviewed changes

aojea reviewed Mar 3, 2024

View reviewed changes

l1b0k reviewed Mar 5, 2024

View reviewed changes

Review updates.

10fc7c5

		18. Kubelet network-based probing is optional for the “Object” connections to
		Pod

		21. “Object” connections to Pod are optionally able to provide Service,
		NetworkPolicies functionality

		12. The Pod reference to a Networks is optional and when NOT specified, Pod
		connects to “Default” “Object” (network the cluster has been created with)

		2. We need to introduce an “Object” that represents the existing
		infrastructure’s networking

		4. The “Object” shall not define any implementation specific parameters in that
		object

		2. We need an API object representing network, which a Pod, a Service or any
		other feature that wishes to support Multi-Network capability could reference


		### Goals

		Define user stories and requirements for the Multi-Network effort in Kubernetes.

		* We will introduce an PodNetwork API object that represents the existing
		infrastructure’s networking.

KEP-3698: Multi-Network #3700

Are you sure you want to change the base?

KEP-3698: Multi-Network #3700

Conversation

mskrocki commented Dec 21, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

maiqueb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

squeed commented Jan 11, 2023

mccv1r0 commented Jan 11, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dabernie Jan 26, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shaneutt commented Mar 30, 2023

Choose a reason for hiding this comment

s1061123 May 9, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shaneutt May 10, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

k8s-ci-robot commented Feb 16, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

uablrek commented Feb 17, 2024

uablrek commented Feb 17, 2024 • edited

danwinship left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mskrocki commented Dec 21, 2022 •

edited

dabernie Jan 26, 2023 •

edited

s1061123 May 9, 2023 •

edited

shaneutt May 10, 2023 •

edited

uablrek commented Feb 17, 2024 •

edited