Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KEP-3698: Multi-Network #3700

Open
wants to merge 2 commits into
base: master
Choose a base branch
from
Open

KEP-3698: Multi-Network #3700

wants to merge 2 commits into from

Conversation

mskrocki
Copy link
Contributor

@mskrocki mskrocki commented Dec 21, 2022

  • One-line PR description: Multi-Network

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Dec 21, 2022
@k8s-ci-robot k8s-ci-robot added kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/network Categorizes an issue or PR as relevant to SIG Network. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Dec 21, 2022
a network
* Identify what “Object” is the Primary network
* Optional parameters: MAC address, IP address, speed, MTU, interface name etc.
14. Every Pod connected to specific network (represented by the “Object”) must
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a necessary limitation for all "Objects"? Perhaps a network is meant only for north-south traffic. I wonder if this could be left to be defined by the implementation, similarly to the next point.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that would be too open. Today's K8s network has exactly this requirement, this way we standardize across all the networks.

Do you have a use case for such situation?

the “Object”)
15. Pods attached to a network are connected to each other in a manner defined
by the “Object” implementation
16. Basic network Interface information for each attachment will be exposed to
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a comparable downward API today? If not, I wonder if this could be deferred to a later phase. I'm concerned about implications of hot-plug or dynamic IPAM on this. It would also make this first phase a little leaner.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dont think there is one. Fair point on maybe moving it to later phase. What is your concern in regards to hot-plug or dynamic IPAM?

16. Basic network Interface information for each attachment will be exposed to
runtime Pod (via e.g. environment variables, downward API etc.)

#### Phase II (scheduler, kubelet and API probing)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would QoS and resource handling fit here? Since this phase is concerned with scheduler, maybe it should cover resource management too - if network would be in the same category as CPU or memory, it would suggest using the same scheduling/pooling mechanism

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think QoS and "network-resourcing" as a whole would be a new requirement. I would consider this as an future extension.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be "implementation specific"? aka something your CNI plugin(s) deal with?

Copy link

@maiqueb maiqueb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this proposal @mskrocki .

keps/sig-network/3698-multi-network-requirements/README.md Outdated Show resolved Hide resolved
keps/sig-network/3698-multi-network-requirements/README.md Outdated Show resolved Hide resolved
keps/sig-network/3698-multi-network-requirements/README.md Outdated Show resolved Hide resolved
during cluster creation that is available to the Pod when no additional
networking configuration is provided.
* **Primary network** - This is the network inside the Pod which interface is
used for the default gateway.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the meetings I've been to, people keep getting confused about these two. It certainly doesn't help that the definition of "primary network" pretty much has to use the word "default".

You could call the first one the "Cluster-wide network"? Or "Traditional Kubernetes network"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The additional networks can be "cluster-wide" and/or "Traditional Kubernetes networks". Naming is always tricky, do you have other recommendations?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I personally think you cannot avoid the "default" term in both scenarios due to historical semantics of networking. The "default network" could be "default instance" or other term based on naming of the object.
The "primary network" could be referred to as "primary attachment" where here the logic is to specify to which object the Pod "primary" interface must be attached

keps/sig-network/3698-multi-network-requirements/README.md Outdated Show resolved Hide resolved
Comment on lines 257 to 258
18. Kubelet network-based probing is optional for the “Object” connections to
Pod
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Optional for who?

I think what you mean is "if a Pod is not attached to the default cluster-wide network, then it is not defined whether network-based probes to that pod will work or not".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

refactored.

Comment on lines 259 to 261
19. “Object” connections to Pod are optionally able to connect to Kubernetes
API - the Pods connections via non-default Pod network does not require access
to Kubernetes API
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are '"Object" connections to Pod'? I'm imagining an object in etcd calling connect() somehow...

I guess you mean "Pods are not guaranteed to be able to reach the Kubernetes API via networks other than the cluster-default network" ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

refactored.

Comment on lines 262 to 264
20. Kubernetes API can optionally reach to “Object” connections to
Pod - Kubernetes API access to the Pod via non-default Pod network is not
required
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"The Kubernetes apiserver is not guaranteed to be able to reach pods via networks other than the cluster-default network"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

refactored.

20. Kubernetes API can optionally reach to “Object” connections to
Pod - Kubernetes API access to the Pod via non-default Pod network is not
required

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Phase II doesn't make sense as a separate phase. Each of the requirements 17-20 says "you don't have to do X". If Phase II is something above and beyond Phase I, then does that imply that in Phase I, you do have to do all those things? Eg, in Phase I, all networks must be available on all nodes, network probes have to work on all networks, etc, and then in Phase II we make it all optional?

I think Phase II needs to just be squashed into Phase I.

Unless everything in Phase II is kind of the opposite of what it says, and you really mean "in Phase I, network probes definitely can't be used on non-default networks but then in Phase II we will define an optional way to make it work". In which case, that needs to be more clear.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is latter part, where we would add support for such capabilities. I have reworded the requirements to better reflect this.

Comment on lines 267 to 268
21. “Object” connections to Pod are optionally able to provide Service,
NetworkPolicies functionality
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Connections to Pods over secondary networks" or whatever. '"Object" connections to Pod' is just weird.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

refactored.

Comment on lines 238 to 239
12. The Pod reference to a Networks is optional and when NOT specified, Pod
connects to “Default” “Object” (network the cluster has been created with)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on the discussion, #3700 (review), pod may NOT connect to "Default" network and it connect to "namespace-default" Network, instead of "Default".

This may changes current behavior and it violates the first requirement, 1. This effort shall not change the behavior of Today existing clusters, because existing cluster connects all pod in cluster network (I believe that is "Default" network in your KEP).

So could you please refine them to be consistent? (change requirement 12 and "namespace-deafult" network, or just remove requirement 1)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How the override of namespace-default network will work will be defined in future. This req is about the default behavior to explicitly satisfy req 1.

@squeed
Copy link

squeed commented Jan 11, 2023

This all seems a bit too abstract. KEPs traditionally propose either some Kubernetes functionality or an API object. It reads like the charter to a working group rather than a KEP.

Is there a way this can be closer to a specific proposal? I'd like to see the API types :-).

@mccv1r0
Copy link

mccv1r0 commented Jan 11, 2023

When the dust settles we need to be able to have a k8s pod do the equivalent of what docker/podman can do today. (Apologies for those on the calls who already heard all this.)

For example, using podman, since it supports the same CNI as k8s, I can use:

sudo podman run --privileged --detach --name=exampleOne --network=public,dmz,storage,private quay.io/nginx:latest

This simple example puts 4 interfaces in the container. What I want to do with them doesn't need to be a concern of the apiserver. In each case I pick plugins, and the .conflist configures them to solve my problem. What I use and how they are configured are implementation specific.

As discussed, we will need a way to pass plugin specific info (static IP, static mac etc.) to each network/.conflist Most/all use cases I've seen in the doc can use docker/podman to demonstrate.

for performance purposes (high bandwidth, low latency), that my user-space
application (e.g. DPDK-based) can use. The VF will not use the standard netdev
kernel module. The Pod’s scheduling to the nodes should be based on hardware
availability (e.g. devicePlugin or some other way).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Usually, when you dig into this, what you discover is that you have a workload that requires some sort of special treatment from the physical network and a fast path to it.

The problem is that this requirement is entirely mis-focused. Literally nobody wants an SRIOV VF. What they want is a fast path to the physical network doing a specific set of things for them.

You can tell this because

  1. If you offer someone a VF leads to the physical network mishandling the packets pushed into the VF, they will be very unhappy.
  2. If you can't offer them a SRIOV VF because your hardware only supports SIOV (nexts technology coming after SRIOV) they will be very unhappy (even if you could have given them a vfio for the SIOV that is connected to the proper physical network treatment)
  3. If you offer them vfio for the VF when they needed a kernel interface, they will be unhappy.

All of this becomes even more crucial when you are scheduling the Pod. If I schedule the Pod to a Node that just has SRIOV VFs available, but its NICs lack the required capabilities, or cannot provide a fast path to the required sort of treatment, this all breaks down.

There are important sets of requirements in here... but I would recommend thinking them through outside of the 'interface' 'network' paradigm being applied here to get to the real root of the need.

I'd suggest rethinking the requirement in terms of:

  1. What Network Service does the user want from the physical network?
  2. How do they want to consume it? Via what mechanism? (kernel interface, rdma, vfio, etc)
  3. What capabilities do they need from the fast path connection to that Network Service?

#### Story #3
I have implemented my Kubernetes cluster networking using a virtual switch. In
this implementation I am capable of creating isolated Networks. I need a means
to express to which Network my workloads connect to.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@danwinship Agreed. This requirement feels like it would be covered by a properly written 'traffic segregation' requirement.

As a Virtual Machine -based compute platform provider that I run on top of
Kubernetes and Kubevirt I require multi-tenancy. The isolation has to be
achieved on Layer-2 for security reasons.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does this have to be achieved at L2 for security reasons? That's an oddly specific requirement. What security reasons?

This is particularly worrisome as K8s carefully avoids having any L2 concepts in it. It would be unfortunate to re-introduce them for such sparse requirements.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This use case is specific on purpose (to concur some of the above comments about use cases being too vague). This is NOT about adding L2/L3 notions into k8s, its about handling such use case. Checkout this sections top description, that any technology or product mentioned here does NOT mean it will be supported, and this is reflected in below requirements where none is talking about supporting specific layer or product.

As a platform operator I need to connect my on-premise networks to my workload
Pods. I need to have the ability to represent these networks in my Kubernetes
cluster in such a way that I can easily use them in my workloads.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we get more clarity on this case? There are lots of ways of achieving this already using BGP in many existing CNI plugins. Is this perhaps another traffic segregation case?

Copy link

@dabernie dabernie Jan 26, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Challenge ... most existing CNI plugins do it by specific CRDs which do not necessarily coexist well. Purpose is to offer a solution where K8S could support such implementations without specific CRDs.

Doing so, the challenge of supporting workloads and avoid the "platform certification" mess might get/should be simpler.

“profiles” for Namespace, where I can not only change default Network,
but define a set of Networks assigned to given Namespace that Pods created in
that NS are automatically attached to.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK... so there's a desire here to isolate workloads (traffic segregation). This time by namespace. It sounds like in this case there's a desire to have a segregated primary network by namespace, but its unclear to me from the requirement as written why this would imply a 'set of Networks'

It feels like this may be smashing two requirements together:

  1. A desire to have per-namespace traffic segmentation for primary networks. If so this opens interesting questions about how that interacts with the rest of K8s networking
  2. A desire to add a set of networks by namespace to a Pod... but its unclear why in this requirement.

Comment on lines 219 to 220
2. We need to introduce an “Object” that represents the existing
infrastructure’s networking

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: Why? Why do we need to introduce an Object here? This feels more like the 'solution' side of things than the 'requirement' side of things.

As engineers, we all have a serious tendency to decide on a solution and then decide that solution is a requirement. We all do it (I more than most). This screams that here. Lets get the real requirements underpinning this proposed solution.

Comment on lines 223 to 224
4. The “Object” shall not define any implementation specific parameters in that
object

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am also confused.

Comment on lines 225 to 227
5. “Object” shall provide option to define:
* IPAM mode: external/internal
* List of route prefixes - optional and not forced on the implementations

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In addition to @danwinship 's points here (which I support)... how are we handling or preventing IPAM that conflicts with the Cluster/Service CIDRs and routes that highjack Cluster/Service CIDRs?

Comment on lines 225 to 227
5. “Object” shall provide option to define:
* IPAM mode: external/internal
* List of route prefixes - optional and not forced on the implementations

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On other point to make: by putting this 'Object' in the K8s API Server, it intrinsically is moved out of the world of the people actually administering the network and into the world of people administering the cluster. Generally speaking in my experience this will work some of the time, break horribly some of the time, and work with a lot of pain for all involved some of the time, depending on the interactions between the K8s administrators and the folks administering the physical network involved.

Having the flexibility to move this point of control around is going to be crucial, and that argues against it necessarily being required to live in the K8s API server.

@shaneutt
Copy link
Member

/cc

keps/sig-network/3698-multi-network-requirements/README.md Outdated Show resolved Hide resolved
for performance purposes (high bandwidth, low latency), that my user-space
application (e.g. DPDK-based) can use. The VF will not use the standard netdev
kernel module. The Pod’s scheduling to the nodes should be based on hardware
availability (e.g. devicePlugin or some other way).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with Dan that this is a particularly compelling user story. As for SRIOV VF I get that this was used as an example and so focusing in on it specifically might be distracting, but I also think it might be worth taking a couple seconds to add a couple more examples just to moot the point.

Comment on lines 204 to 205
2. We need an API object representing network, which a Pod, a Service or any
other feature that wishes to support Multi-Network capability could reference
Copy link

@s1061123 s1061123 May 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still now I am not quite clear that why we need this object, as @edwarnicke adds in comment

Could you please elaborate it to clarify:

  • What the 'object' archieved to Kubernetes users?
  • Why the 'object' is required for users?
  • Without the 'object', what is not archieved?

Based on the current Kubernetes community situation, multus, network service mesh and other network components, outside of Kubernetes core, do achieve multi network without core network objects.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Today's implementation are creating CRDs, that have to be referenced via annotations or other in-direct means. Annotation is not meant to define configuration of an object, additionally there is no validation of the user input.
On top of that every implementation has its own CRDs and way of configuration, so there is no one standard.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Today's implementation are creating CRDs, that have to be referenced via annotations or other in-direct means.

Some network implementation uses CRDs, but not all. For example, flannel does not use CRDs. In addition, each implementation has several CRDs to specify network but it is depends on the implementation. Some implementation use a CRD for 'network', but some implementation use CRDs for 'network IPAM' and 'network transport (i.e. SDN configuration)'.

Your design assumes that one network object is required for everyone, for each implementation, but actually it is not, I guess. So please clarify and give a reason why we need the object. Currently the reason seems to be ambiguous.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main reason is to have a handle/representation of "network" Pod connects/attaches to. This does NOT exists Today. This is mentioned in the Summary and Motivation, maybe it is not sufficient. I will add more there.


### Goals

Define user stories and requirements for the Multi-Network effort in Kubernetes.
Copy link
Member

@shaneutt shaneutt May 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At today's multi-network sync (May 10th, 2023 - notes available here) there was a conversation involving several members of the community regarding the desire for this KEP to better define standards for defining additional parameters of a network interface for a Pod (perhaps particularly to help support DRA and QoS configurations) which may be implementation-specific, or perhaps generic enough for all implementations as an explicit goal of this KEP.

Personally I agree with the high level goal, but I think we should avoid trying to put new things in core if possible for a variety of long term maintenance reasons. There are other takes on this which I invite conversation to 🧵 here about.

Ultimately, I would like to see that conversation continued and resolved before we merge this KEP and the goals here adjusted accordingly.

cc @marquiz (re #3004)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @shaneutt. I need to read this KEP with thought, but ulitmately I think that DRA (if any) would be in line with requirements. The QoS resources KEP (#3004) is something that could possibly be used to cover some current usage scenarios involving e.g. annotations. But it's good to understand what it is and how it's different.

cc @pohly

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My thoughts are that there are two places where we might have arbitrary config.

  • At the network level. For example, parameters specifying what VLAN ID this network has, or what IP address ranges it is allowed to use.
  • At the network interface level. For example, parameters specifying the QoS for a particular interface, or the allowed bandwidth, or the required failure domain to allow redundancy.

These are all going to be implementation dependent, and we shouldn't put them into core Kubernetes. The easiest way to do this is to have a CR with whatever specific implementation specific horrors are required in it, and allow the standard K8s objects to have a reference to that CR (with it being entirely up to the implementation what to expect in the CR and what to do with it).

There is already a reference to a CR in the PodNetwork resource. My takeaway from the discussion was that the PodNetworkAttachment resource probably needs exactly the same capability.

I'm open to some common fields making their way into the PodNetworkAttachment (say MAC address or even IP as well as interface name or similar), but I can see that turning into something pretty ugly pretty fast if we aren't careful, so shoving as much as possible into a CR seems safer. I'm also open to this being "something we add in a later phase, but agree we expect to do later". Finally, I think the alternative is some kind of annotation model, where we optionally list interfaces and references to their config in annotations in the pod, which is plausible; a bit ugly but also less intrusive to core K8s.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As this is a requirements KEP, we should try to avoid blocking it on other KEPs unless we think it is fundamentally a non-starter for the community. The cost paid for work on refining requirements is community time rather than long-term tech debt.

* **Cluster Default PodNetwork** - This is the initial cluster-wide PodNetwork
provided during cluster creation that is available to the Pod when no
additional networking configuration is provided in Pod spec.
* **Primary PodNetwork** - This is the PodNetwork inside the Pod which interface
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should be specific about the cardinality relationship between PodNetwork and interfaces. Is it always 1-1? If not, we will need to be very careful whenever one is mentioned vs another.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume we will probably land up with either a scope field in the API or different APIs (Kinds) for each of these use-cases ? Wondering if the following nomenclature pattern is more intuitive.

  1. Cluster Default PodNetwork
  2. Namespace Default PodNetwork (For future expansion, note Default is not the name of the namespace, this is the default network associated with a namespace)
  3. Default PodNetwork (instead of primary network, this is the one assigned to the pod, either by policy or by the user picking one).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bowei The cardinality relationship is going to be defined in the Phase I. I dont want to mention that in this phase.

@pvenu-Google using "Default PodNetwork" when we already have "Cluster Default PodNetwork" will not be distinguishable, most folks will keep skipping the "Cluster" word which will lead to confusion. And we want to keep the "Default" PodNetwork for what it is now to align with the "default" namespace.

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: mskrocki
Once this PR has been reviewed and has the lgtm label, please assign thockin for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Comment on lines +249 to +250
* We will introduce an PodNetwork API object that represents the existing
infrastructure’s networking.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do you solve the boostrap problem, Nodes are created and then Pods, when is the PodNetwork object created and by whom?
what is the lifecycle of the PodNetwork object?
how does it interact with the lifecycle of the Nodes and Pods objects?
How is the infrastructure represented?
How does it handle network partitions in the infrastructure? What happens if NodeA and talk with Node B but not viceversa?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most of these are answered below. Regarding

How does it handle network partitions in the infrastructure? What happens if NodeA and talk with Node B but not viceversa?

This KEP will not define/explain any infrastructure relations. The PodNetwork is just an abstraction for that, and it is up to the implementation on how strongly it is coupled with the infra.

@uablrek
Copy link

uablrek commented Feb 17, 2024

Why all the use-cases?

As I see it, all of them are implementation specific. For the API extension, which the KEP is about, it would be sufficient with a use-case #0, like:

As an advanced network user I would like to add various network interfaces to various PODs for various reasons

@uablrek
Copy link

uablrek commented Feb 17, 2024

I miss a detailed description of the clean-up of all PodNetwork attachments in a POD when it dies. In my experience it is always much, much harder to delete things than to add them. As for now the CRI-plugin will call the CNI-plugin with a DEL command. With Multus, it's the responsibility of Multus to do a DEL on all it's sub-cni-plugins.

But who will do this in K8s multi-networking? Kubelet? If so, will KNI become a prerequisite?

Will this part be skipped? If so, it will be a requirement on implementations to garbage-collect resources such as IP addresses.

Copy link
Contributor

@danwinship danwinship left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think the big picture here is correct.

interfaces
* applications leveraging performance-oriented interfaces (e.g. `AF_XDP`,
`memif`, `SR-IOV`)
* applications requiring support for protocols not yet supported by Kubernetes
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what this means... If you only care about supporting the protocol on the primary network interface then isn't it just a matter of selecting a primary network plugin that supports that protocol? (I'm not sure what level of "protocol" you're referring to, but SCTP and multicast are examples here: k8s conformance doesn't require either of those, and many plugins don't allow them, but some do.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do mean Networking protocols e.g. multicast as you mentioned. Agree that you could change primary CNI and support any protocol, but what if I dont want to touch my default Network configuration but still add support for the new protocol. With MN I can just create new PodNetwork that would use the additional/new CNI, and it only for those Pods that needs it.

Comment on lines +111 to +113
* Introduce API object to kubernetes describing networks Pod can attach to.
* Evolve current Kubernetes networking model to support multiple networks,
defining the new model in a backwards compatible way.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there's increasing consensus that we ought to have a better network model, and objects that describe networks, even in the single-network case (eg "KNI").

I'm not sure what the right way to talk about this in the KEP, since KNI is even more pre-alpha than Multi-Network. But I think it's important to be thinking of this as not just about multiple networks, but also about having a better (and more k8s-native) model of pod networking in general, which in particular supports multiple networks.


### Non-Goals

Define the CNI implementation.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not exactly sure what you mean here (even after finishing reading the KEP).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant to say here that this KEP does not specify how to implement the CNI, but just provide API abstraction/handler.

keps/sig-network/3698-multi-network/README.md Outdated Show resolved Hide resolved
for performance purposes (high bandwidth, low latency), that my user-space
application (e.g. DPDK-based) can use. The VF will not use the standard netdev
kernel module. The Pod’s scheduling to the nodes should be based on hardware
availability (e.g. devicePlugin or some other way). This interface might not
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not clear if you mean "k8s will implicitly use a devicePlugin or something to handle scheduling" or "the user will need to explicitly think about devicePlugins to get scheduling to work right".

(The latter is how it works in Multus; when using SR-IOV, it's not enough to just request the SR-IOV NetworkAttachmentDefinition, you have to also request an SR-IOV resource. But this is annoying and makes SR-IOV NADs work differently from other kinds of NADs (eg, ipvlan-based). Of course, Multus can't make scheduling happen automatically because the scheduler doesn't know anything about Multus, but a k8s-native API ought to be able to do better here.)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This exact functionality is described in DRA https://docs.google.com/document/d/1XNkTobkyz-MyXhidhTp5RfbMsM-uRCWDoflUMqNcYTk/edit?disco=AAAA9O0je4I , I don't think we should duplicate functionalities , we need to understand what is missing in DRA (that is being revisited these days) and suggest the necessary changes if we can reuse that mechanism for this

// Examples: eth1 or net1
//
// +optional
InterfaceName string `json:"interfaceName,omitempty"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does it make sense to specify this here rather than as a ParametersRef? Particularly given that it's not even meaningful for some network types (eg, the bandwidth user story).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is to ensure the name is unique across the list of Networks specified. Otherwise there is no control, especially if I will attach 2 PodNetworks from 2 different vendors.

// for it.
//
// +optional
IsDefaultGW4 bool `json:"isDefaultGW4,omitempty"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Likewise, why does it make sense to specify this here rather than as a ParametersRef? It seems unlikely that for a given PodNetwork, some pods would want to have it as their default route and others would not. (Though of course, if that were they case, they could just use different PodNetworkAttachments.)

in particular, having separate IPv4 and IPv6 default routes is a really niche use case

(as with IPAM, I feel like the eventual solution here is a standard parameter type)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as with interfaceName, how do I coordinate 2 PodNetworks that both indicate IsDefault?

agree with the v4 v6 split, let me remove that and if need be we can make it more complex in future.

keps/sig-network/3698-multi-network/README.md Outdated Show resolved Hide resolved
keps/sig-network/3698-multi-network/README.md Show resolved Hide resolved
keps/sig-network/3698-multi-network/README.md Outdated Show resolved Hide resolved
Comment on lines +519 to +524
We have considered using classes (similar to GatewayClass etc.), but we do not
expect any extra action for a PodNetwork to take place for specific implementation.
PodNetwork already points to a custom resource, which implementation can hook on
for any specific extra configuration. At this point of time we consider using
classes as overkill, and in future if such need arises, we can reuse the provider
field for that purpose.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, I think your model of how this works is that there's a single component in the cluster responsible for all PodNetworks.

But another model would be to allow multiple PodNetwork-managing components in the cluster, each supporting different functionality, and managing only the PodNetworks of their own PodNetworkClass. For example, there might be separate PodNetworkClasses implementing each of the User Stories (though there could also be "heavyweight" multi-use-case PodNetworkClasses like the implied universal PodNetworkClass currently envisioned by the KEP).

I think the PodNetworkClass model would lead to more reused code and more common functionality across cluster types. Given the model of the current KEP, it's possible (indeed, likely) that, say, GKE and OpenShift could both provide ways of implementing all of the User Stories, but those ways would be entirely incompatible, involving different Provider names and ParametersRefs of completely different CRD types. Whereas with the PodNetworkClass model, GKE and OpenShift might each have some PodNetworkClasses of their own, but could also make use of "generic" ones (possibly ones written by users themselves).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The multiple implementation in cluster suppose to be handled by the provider field, assuming the implementations support that field, and this makes the interoperability between implementation dependent on each other.

With class this is enforced if we make that field mandatory. Will bring this up in next sync meeting and talk it thru.

#### Story #8
As Cluster operator I wish to manage my cluster’s bandwidth usage on a per-pod
basis, simultaneously preserving all other Pod networking configuration the same
across those Pods.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The more I think about this, the less it seems to fit with this KEP.

Or at least, it doesn't fit as a PodNetwork. This seems more like a PodNetworkAttachment that points to the default PodNetwork but also includes a ParametersRef pointing to some well-known parameter type containing bandwidth information...

(And maybe that's actually what you meant? I guess it's not clear from the story itself...)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@danwinship you have references to PodNetworkAttachment as in different places, and if for that you mean the representation on a Pod of a network interface https://docs.kernel.org/networking/netdevices.html, then this is the part I agree 100% we need to solve first.

And this is what actually DRA is solving #3063 , it seems that the API is still revisiting and based on my current reading it can accomodate perfectly the use cases of PodNetworkAttachment

Once we have this capabilities we can use them to build the other high level abstractions required in this KEP

rule of: single Pod references a given PodNetwork only 1 time

#### Auto-population
When networks field is not set, and hostNetwork is not set, we will
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what happens with pods with hostNetwork: true? to which network do they belong?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

None, MN does not apply to HN pods.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so how does this work then, if I have 3 pods, podA in networkA pod B in networkB and podC in host network, which Pods are able to communicate with podC?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because Today there is only 1 network, root ns and Pod ns is treated as one. And we need to still assume that, and we call this a "default" Network, pods connected to that Network has to be backward compatible. Any other networks behavior is up the implementation. For some, isolation is a factor for some it is not, we cannot enforce that.

The above status is expected to be populated by kubelet, but this can only happen
after CRI provides support for the new Pod API. Because of that, initially
kubelet will behave as it does today, without updating the additional fields.
Until CRI catches up, the PodNetwork providers will be able to update that field
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we do not merge split functionality, CRI API changes has to be done at the same time

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CRI changes has to come with container runtime changes. Considering all the functions of CRI take Pod as an argument, the required changes are there from my point of view.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Considering all the functions of CRI take Pod as an argument, the required changes are there from my point of view.

Making changes on the Pod API does not mean runtimes will adopts these changes

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And that is fine, I have agent-based CNI, that does not care about them.

Comment on lines +970 to +972
Considering all above changes are in the direct scope of CRI, this KEP will not
propose complete changes for them, and a separate KEP will be created to cover it.
Below are some suggestions on what the changes could look like.
Copy link
Member

@aojea aojea Feb 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if those changes are required for this KEP they must be in this KEP, or are you suggesting we merge some changes first and the others later?
the changes merged are going to work without the other changes? we can have merged something that is not functional

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I would see that we introduce this API as first step. Then container runtimes catch up and adjust to the new fields Pod has. Otherwise the change just grows too large. Lastly I can solve this with agent-based CNI Today, that does not rely on CRI, until they catch up.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I would see that we introduce this API as first step. Then container runtimes catch up and adjust to the new fields Pod has.

I think you need to provide more details on this, how much do you estimate container runtime catch up on features?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IDK, considering KNI proposal is on the table as well, maybe then it will be independent. My point is, I do not want to make dependency on container runtimes.

Comment on lines 949 to 952
### Endpointslice controller changes
Considering changes to Pod.PodStatus.PodIPs list, we must ensure that the
controller is using the correct IPs when creating the endpoints. We will ensure
that only IPs for "defautl" PodNetwork will be used.
Copy link
Member

@aojea aojea Feb 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This requires a much deeper explanation as this is the core functionality of Services and in extension Ingress and Gateway API. Services are namespaced, so it seems that there can be Services that has Pods in different networks, - how are the EndpointSlice controller going to handle these Services?

  • If there can be overlapping IPs, and some of the Pods in the service belong to different networks with overlapping, how this will be handled?
  • Where is the ClusterIP going to be present, in all networks or only in one? which one?
  • For headless services, how DNS will be used in each network? How the resolv.conf of the Pods will look like, specially on multi homed pods?
  • Webhooks use Services and have to be connected from the apiserver, if overlapping IPs is allowed, how can you guarantee that the apiserver is able to get to all the pods in all networks? it must be modified to be network aware ?
  • same for the kubernetes.default service, it has to be reachable from all networks, how the endpoint reconciler in the apiserver will choose the endpoints for all networks?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This KEP does not introduces support for Services. Based on @danwinship feedback, we cannot reuse PodIPs field in status anyway, and have to add new field, which will not affect this controller, I have removed this section.

that only IPs for "defautl" PodNetwork will be used.

### Kubelet changes
We will introduce an additional check for kubelet readiness for networking.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you need to handle also the resolv.conf and all the dns configuration in the kubelet

Copy link
Contributor Author

@mskrocki mskrocki Mar 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this resolv.conf of the node or one that Pods gets? I assume Node's one is handled by the platform, and DNS for Pods is not in-scope for this Phase.

Lastly, based on @danwinship feedback above, I realized that having scheduler guarding Pods from being send to Node before required PodNetwork is present covers this functionality, I removed this section.


### Scheduler changes
These are the changes we will do in Pod scheduler:
* Provide active Pod spec validation
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you expand on what are those active pod spec validation?

Copy link
Contributor Author

@mskrocki mskrocki Mar 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

described above, will provide a link in the text

These are the changes we will do in Pod scheduler:
* Provide active Pod spec validation

When one of the multi-network validation fails, scheduler will follow the current
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will require sig-scheduling approval and also provide more detail, specially about the possible implications on performance and autoscaling
cc: @alculquicondor

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where does this validation happen?

Are you adding any new scheduling plugins?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TBH I have not investigated this deeper. This is just a description of what I would like to achieve on the scheduler side. If I can just do this with a plugin, then yes. Will follow up.

#### Pod Creation
The Pod creation is handled by the SyncPod function, which calls RunPodSandbox
([code](https://github.com/kubernetes/cri-api/blob/release-1.28/pkg/apis/runtime/v1/api.proto#L40))
CRI API. The parameters for that function are defined by PodSandboxConfig ([code](https://github.com/kubernetes/cri-api/blob/release-1.28/pkg/apis/runtime/v1/api.pb.go#L1343)).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You need approval from sig node @SergeyKanzhelev , and from container runtimes , at least containerd and crio

@aojea
Copy link
Member

aojea commented Feb 25, 2024

My summary:

  • user stories does not define clearly the user problems, we should understand better the user problem as some of them may be solved in a more Kubernetes way instead of having to replicate all the virtual network complexity in kubernetes itself.

Per example, some user stories I heard for using Pods with an additional external interface was to implement bgp against an upstream router to implement fast failover, when I asked why they will not use probes they said that it was because they were slow, but they will prefer to use probes if they can operate in subseconds, so that can be solved with #3066

Another comment request that we never solved is Support port ranges or whole IPs in services #23864 , to be able to assign IPs to Pods, and then people has to add an external interface to the Pod for that.

  • the list of requirements needs a more clear justification, is not easy to get the relation with the user stories
  • the new network objects seems to implement some kind of logical partitioning at the network level, however, Kubernetes uses namespaces as the logical partitioning https://kubernetes.io/blog/2016/08/kubernetes-namespaces-use-cases-insights/ , This may imply we'll have a logical partition and a physical partition via the network, but the physical partition will not be easy to observer by the users, as it seems possible to have pods with multiple networks in the same namespace, this does not look some nice UX, also opens a lot of doubts about the feasibility of this change
  • I miss a lot details about the overall behavior of the cluster, NetworkPolicies, Services, Webhooks, the kubernetes.default Service, DNS, kubectl exec, port-forward, the KEP has to define what will be the behavior of this core functionalities of kubernetes
  • There are external dependencies, SIG Network can drive the KEP but we can not impose changes to other sigs, there are at list two SIGs: Node and Scheduling that will need to approve this changes, and I expect API Machinery and Scalability and Architecture to get involved, as webhooks and kubernetes.default at least are going to be impacted, also it seems we'll need the implementations on the container runtimes, we can not merge API objects and wait several releases for the runtimes to implement the changes

@danwinship
Copy link
Contributor

  • the new network objects seems to implement some kind of logical partitioning at the network level, however, Kubernetes uses namespaces as the logical partitioning

Namespaces are not the only unit of logical partitioning in Kubernetes. For example, Pods are also partitioned by Node (Local traffic policy) and by zone (topology-aware routing).

@aojea
Copy link
Member

aojea commented Feb 25, 2024

Pods are also partitioned by Node (Local traffic policy) and by zone (topology-aware routing).

at the Service level, not at the pod level


N/A

## Alternatives
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some references to projects that implement a network oriented architecture in kubernetes and/or with containers, we should also understand the lessons learned from them and why, if these projects provide the same functionality, are not being used today? The alternatives section should consider them

https://github.com/openshift/kuryr-kubernetes
https://docs.docker.com/network/
https://kubevirt.io/user-guide/virtual_machines/interfaces_and_networks/
https://events19.linuxfoundation.cn/wp-content/uploads/2017/11/Multiple-Networks-and-Isolation-in-Kubernetes-_Michael-Xie.pdf

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those are specific implementation, not sure how these are alternatives? Per @danwinship feedback I moved DRA discussion here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if those specific implementation already solve this problem and are available and working, specially the kubevirt one, we need to understand why do we discard them. ... DRA is a piece of what you are proposing, you are proposing a full multinetowrk architecture to support kubernets as an IaaS ... kubevirt solves this problems and already works, we need to understand why kubevirt is not a valid option and why we need to make it in kubernetes

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you are missing main point of this KEP, we do not propose a solution for MN implementation but a common API all these implementation you listed could use. All of these have their own CRDs and ways to reference them in the Pod, this KEP is trying to standardize the reference aspect ONLY.

* For specific PodNetwork a Pod can either reference that PodNetwork or
PodNetworkAttachment (referencing that PodNetwork), but not both at the same time
* PodNetwork can be referenced by multiple Pods
* PodNetworkAttachment can be referenced by multiple Pods
Copy link

@l1b0k l1b0k Mar 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In multi-tenancy situations, Pod dimension configuration is usually different.
So that the PodNetworkAttachment data may be 1:1 with the number of Pods.

This will cause a lot of overhead.

If PodNetworkAttachment is a single Pod dimension configuration, there is no need to be referenced by multiple Pods.

If the network definitions are different, multiple PodNetworks are used to plan the network plane.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is to provide flexibility to the implementations. One can support 1:1 the other 1:n.

Copy link
Member

@aojea aojea Mar 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can not regress in scalability and performance, @l1b0k question and comment is spot on, we need to understand the impact of this overhead, we can not sacrifice performance by flexibility without having a real understanding on how much penalty will cause this flexibility , and what is the benefit of this flexibility ... open a door for multiple implementations to behave different is not a good answer if this means users will have different behaviors on different places for the same cluster and APIs

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/network Categorizes an issue or PR as relevant to SIG Network. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet