filesystem is not resized when restoring from snapshot/cloning to larger size than origin #972

RomanBednar · 2022-05-03T08:09:17Z

What type of PR is this?

/kind bug

What this PR does / why we need it:

Volume filesystem size is not expanded during clone if pvc data source is larger than original volume.
Similarly when restoring a snapshot to a pvc with larger size the filesystem is not expanded either.

This patch uses mount-utils to handle volume size check and resizing.

Which issue(s) this PR fixes:

Fixes #784

Special notes for your reviewer:
Here is how I verified the patch manually.

BEFORE PATCH:

1.Create pvc with 1Gi/pod

2.Create clone pvc with 5Gi/pod

$ oc get pvc/clone-of-pvc-1 -o yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  annotations:
    pv.kubernetes.io/bind-completed: "yes"
    pv.kubernetes.io/bound-by-controller: "yes"
    volume.beta.kubernetes.io/storage-provisioner: pd.csi.storage.gke.io
    volume.kubernetes.io/selected-node: ci-ln-469g9qt-72292-bp8k2-worker-a-jmrcv
    volume.kubernetes.io/storage-provisioner: pd.csi.storage.gke.io
  creationTimestamp: "2022-04-11T06:30:29Z"
  finalizers:
  - kubernetes.io/pvc-protection
  name: clone-of-pvc-1
  namespace: test
  resourceVersion: "40853"
  uid: 4d439336-35b1-4836-8ece-2c21ce36d1b2
spec:
  accessModes:
  - ReadWriteOnce
  dataSource:
    apiGroup: null
    kind: PersistentVolumeClaim
    name: pvc2
  resources:
    requests:
      storage: 5Gi
  storageClassName: standard-csi
  volumeMode: Filesystem
  volumeName: pvc-4d439336-35b1-4836-8ece-2c21ce36d1b2
status:
  accessModes:
  - ReadWriteOnce
  capacity:
    storage: 5Gi
  phase: Bound
  
3. Check PVCs

$ oc get pvc
NAME             STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS   AGE
clone-of-pvc-1   Bound    pvc-4d439336-35b1-4836-8ece-2c21ce36d1b2   5Gi        RWO            standard-csi   59m
pvc2             Bound    pvc-9508e898-d958-412b-b2c8-4523f87d7b25   1Gi        RWO            standard-csi   69m

4. The filesystem was not resized in the pod (5Gi expected)

oc exec mypodclone -t -i -- df -h /tmp
Filesystem                Size      Used Available Use% Mounted on
/dev/sdc                975.9M      2.5M    957.4M   0% /tmp

AFTER PATCH:

1. Create pvc with 1Gi/pod

2. Create snapshot of the pvc

3. Create clone of the pvc with 5Gi/pod

4. Restore the snapshot to a new pvc with 5Gi/pod

5. Create a pod that mounts both clone and restored snapshot pvc

6. Check that the filesystem was resized correctly in the pod

$ oc get pvc
NAME                     STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS   AGE
pvc-1                    Bound    pvc-71b5bb22-2ca7-4150-af44-b5b0b48111a7   1Gi        RWO            standard-csi   15m
pvc-1-clone              Bound    pvc-e6ebb004-baba-446b-b5f5-90bac00240c9   5Gi        RWO            standard-csi   14m
pvc-1-snapshot-restore   Bound    pvc-6d325c59-c6e1-428b-b83d-958c78c71964   5Gi        RWO            standard-csi   13m

$ oc exec task-pv-pod-2 -t -i -- df -h | grep -E "clone|restore"
/dev/sdd        4.9G  4.0M  4.9G   1% /tmp/clone
/dev/sdc        4.9G  4.0M  4.9G   1% /tmp/restore

Does this PR introduce a user-facing change?:

Filesystem is now expanded correctly when restoring from a snapshot or cloning a volume with larger size than the origin volume.

k8s-ci-robot · 2022-05-03T08:09:24Z

Welcome @RomanBednar!

It looks like this is your first PR to kubernetes-sigs/gcp-compute-persistent-disk-csi-driver 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes-sigs/gcp-compute-persistent-disk-csi-driver has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

k8s-ci-robot · 2022-05-03T08:09:25Z

Hi @RomanBednar. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

RomanBednar · 2022-05-03T09:05:33Z

@mattcary Hi, I've noticed there are tests for NodeExpandVolume that are commented out with a note about fakeexec being too brittle:

gcp-compute-persistent-disk-csi-driver/pkg/gce-pd-csi-driver/node_test.go

Line 454 in 0098649

// TODO: This test is too brittle due to the fakeexec package not being

But, by adding mount-utils resizer to NodeStageVolume I broke the existing unit tests for this function because there are actual command executions in the resize flow and so I used the fakeexec - just to have some working example. Better approach might be using a mock interface instead. What do you think?

mattcary · 2022-05-03T15:08:42Z

/ok-to-test

mattcary · 2022-05-03T15:13:28Z

I dunno, creating a mock that would actually catch errors or regressions is very hard. Maybe just stick with fakeexec for now and we'll see how it does.

This should be exercised in e2e tests as it's part of the csi spec (but probably isn't?)

RomanBednar · 2022-05-10T10:14:37Z

I dunno, creating a mock that would actually catch errors or regressions is very hard. Maybe just stick with fakeexec for now and we'll see how it does.

This should be exercised in e2e tests as it's part of the csi spec (but probably isn't?)

We've investigated the tests with @jsafrane and we currently see two issues:

GCP driver sanity tests use fakes (mounter+cloud provider)
This causes the resize code to fail on command executions because it can't read real size of the device.
However, fakeexec with predefined command sequence (like I used in unit tests) is probably not an option
for sanity tests because it shares the fake mounter across many other tests.

https://github.com/kubernetes-sigs/gcp-compute-persistent-disk-csi-driver/blob/master/test/sanity/sanity_test.go#L62
CSI e2e tests currently don't excercise resize code for NodeStageVolume
Update of the tests is planned in scope of the snaphot restore/resize effort. It is mentioned here as the last
item on the checklist: Filesystem resize is required after snapshot restore / volume clone kubernetes/kubernetes#94929

I'll look into it and update the CSI tests accordingly once we resolve the sanity test issue. Another catch is that adding csi e2e tests will take quite some time before we can use it (until next Kubernetes release at best). @mattcary What would you suggest to fix the sanity tests? And do we want to wait with this patch until e2e tests are available?

mattcary · 2022-05-10T16:10:06Z

Thanks for laying out the state of things so clearly!

Would it be easy to add a gcp-persistent-disk-csi-driver/test/e2e test? (confusingly this is not the k/k e2e test)

It should be about as straightfoward as adding a sanity check, except it uses a real cloud provider etc so no faking is necessary.

jsafrane · 2022-05-11T10:36:05Z

@mattcary, while it would be possible to add a new test to test/e2e, it's not easy to fix e2e/sanity - all NodeStage calls there will fail, because the fake mounter / exec does not return a valid device size when asked. To me it looks like fixing that would take a huge amount of work with a doubtful result.

Out of curiosity, is there any reason why don't you run pull-gcp-compute-persistent-disk-csi-driver-sanity tests with a real cloud + mounter + exec?
In addition, running k/k tests would be beneficial too, you are able to install the driver and run the tests in k/k CI, why not here?

mattcary · 2022-05-11T16:28:41Z

@mattcary, while it would be possible to add a new test to test/e2e, it's not easy to fix e2e/sanity - all NodeStage calls there will fail, because the fake mounter / exec does not return a valid device size when asked. To me it looks like fixing that would take a huge amount of work with a doubtful result.

I agree. Sorry for not making that more clear---I don't think we should try to fix sanity, I think we should just put it in an e2e test. That would be skipping resize tests in sanity (the other tests will continue to work, or have those been broken too?)

I'm not actually familiar with the sanity tests, I've never had a reason to look at them. If we do run them in a cloud context, would they substantially duplicate the e2e tests? I'll have to look into them. I suppose our e2e tests hit GCE specific features that aren't covered by sanity, and sanity does more thorough generic CSI testing. If my supposition is true I think it would probably be a good idea to run them with real resources. That's been reliable in the e2e tests, and I've never found fakes a very convincing test strategy.

Out of curiosity, is there any reason why don't you run pull-gcp-compute-persistent-disk-csi-driver-sanity tests with a real cloud + mounter + exec? In addition, running k/k tests would be beneficial too, you are able to install the driver and run the tests in k/k CI, why not here?

test/k8s-integration runs the k/k tests. There's real value in splitting those out from e2e tests as they take a lot longer to run and are way way flakier. Just bringing up a cluster consistently across k8s minor versions, different versions of kubetest, etc, is a lot of toil.

jsafrane · 2022-05-12T12:42:29Z

That would be skipping resize tests in sanity (the other tests will continue to work, or have those been broken too?)

Unfortunately, all sanity tests that call NodeStage will fail, because NodeStage is not able to read device size with the fake exec.

If we do run them in a cloud context, would they substantially duplicate the e2e tests?

There definitely is some overlap in your custom e2e tests, kubernetes-integration and sanity test. The sanity tests add some checks that cannot be easily tested in Kubernetes, like that the driver returns the right error code when a volume does not exist, staging path does not exist and so on. If you want to use sanity tests, then you should run them against a real CSI driver in a real cloud. On the other hand, you don't need Kubernetes to run the tests (unless the driver itself needs to talk to Kubernetes). You need a VM in GCE, run the CSI driver in the VM (either in a container or directly on the host, it does not really matter) and poke the driver's CSI socket with the sanity tests. It will provision real volumes, attach them to the VM and mount them.

mattcary · 2022-05-18T16:17:32Z

See #991 and kubernetes/test-infra#26354, we're making the sanity tests optional while we sort this out.

I think that unblocks this PR? (as soon as the test-infra change goes in)

jsafrane · 2022-05-23T11:34:38Z

/refresh
/retest

k8s-triage-robot · 2022-09-13T13:12:32Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

mattcary · 2022-09-13T16:46:44Z

/remove-lifecycle stale

NodeStageVolume() which is called few lines above in NodeStageExt4Volume() can now resize filesystem if it's needed so this check is no longer valid.

mattcary · 2022-10-25T15:47:45Z

Thanks for the review @mauriciopoppe !

/lgtm
/approve

(and of course thanks for the PR @RomanBednar :-)

k8s-ci-robot · 2022-10-25T15:47:53Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: mattcary, RomanBednar

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [mattcary]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

This test was also disabled upstream: https://github.com/kubernetes/test-infra/pull/26354/files The problem is that sanity tests currently use command faker and only one driver instance. That makes any command faking impossible as the same instance of the driver is re-used in many tests. Relevant discussion upstream is here: kubernetes-sigs/gcp-compute-persistent-disk-csi-driver#972 (comment)

… is not backported to 1.8-. The fix is only in 1.9+.

k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/bug Categorizes issue or PR as related to a bug. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels May 3, 2022

k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label May 3, 2022

k8s-ci-robot requested review from jingxu97 and mattcary May 3, 2022 08:09

k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label May 3, 2022

k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels May 3, 2022

mattcary mentioned this pull request May 18, 2022

Run CSI sanity tests with real cloud provider instead of fakeexec #991

Closed

RomanBednar force-pushed the snapshot-expansion branch from c244d1f to 8ffb6e1 Compare June 15, 2022 11:50

jsafrane mentioned this pull request Jun 15, 2022

add e2e test for correct resizing of a snaphot restore kubernetes/kubernetes#110304

Merged

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 13, 2022

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 13, 2022

RomanBednar force-pushed the snapshot-expansion branch 2 times, most recently from b925ef8 to 0cc341d Compare September 30, 2022 15:37

disable pre-resize check of fs size

41b3dfd

NodeStageVolume() which is called few lines above in NodeStageExt4Volume() can now resize filesystem if it's needed so this check is no longer valid.

RomanBednar force-pushed the snapshot-expansion branch from dc2d805 to 41b3dfd Compare October 24, 2022 11:10

k8s-ci-robot assigned mattcary Oct 25, 2022

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 25, 2022

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 25, 2022

k8s-ci-robot merged commit b4f6d95 into kubernetes-sigs:master Oct 25, 2022

leiyiz mentioned this pull request Nov 21, 2022

read only volumes should not be expanded and we need OSS ROX testing #1088

Closed

msau42 mentioned this pull request Nov 30, 2022

Restoring from a snapshot to a larger size kubernetes-csi/external-provisioner#485

Closed

4 tasks

RomanBednar mentioned this pull request Dec 6, 2022

bump GCP driver version in e2e storage tests kubernetes/kubernetes#114306

Closed

judemars mentioned this pull request Apr 7, 2023

Add ROX storage tests kubernetes/kubernetes#114628

Merged

amacaskill added a commit to amacaskill/gcp-compute-persistent-disk-csi-driver that referenced this pull request Apr 14, 2023

Skip snapshot and restore test because test fix (kubernetes-sigs#972)…

efe027e

… is not backported to 1.8-. The fix is only in 1.9+.

amacaskill added a commit to amacaskill/gcp-compute-persistent-disk-csi-driver that referenced this pull request Apr 14, 2023

Skip snapshot and restore test because test fix (kubernetes-sigs#972)…

52f0aae

… is not backported to 1.8-. The fix is only in 1.9+.

amacaskill mentioned this pull request Apr 14, 2023

Skip snapshot and restore test #1189

Merged

amacaskill added a commit to amacaskill/gcp-compute-persistent-disk-csi-driver that referenced this pull request Apr 14, 2023

Skip snapshot and restore test because test fix (kubernetes-sigs#972)…

a466890

… is not backported to 1.8-. The fix is only in 1.9+.

amacaskill added a commit to amacaskill/gcp-compute-persistent-disk-csi-driver that referenced this pull request Apr 14, 2023

Skip snapshot and restore test because test fix (kubernetes-sigs#972)…

875af95

… is not backported to 1.8-. The fix is only in 1.9+.

amacaskill added a commit to amacaskill/gcp-compute-persistent-disk-csi-driver that referenced this pull request Apr 14, 2023

Skip snapshot and restore test because test fix (kubernetes-sigs#972)…

65c0b01

… is not backported to 1.8-. The fix is only in 1.9+.

amacaskill added a commit to amacaskill/gcp-compute-persistent-disk-csi-driver that referenced this pull request Apr 14, 2023

Skip snapshot and restore test because test fix (kubernetes-sigs#972)…

f06e16e

… is not backported to 1.8-. The fix is only in 1.9+.

amacaskill added a commit to amacaskill/gcp-compute-persistent-disk-csi-driver that referenced this pull request Apr 14, 2023

Skip snapshot and restore test because test fix (kubernetes-sigs#972)…

c2eeaf5

… is not backported to 1.8-. The fix is only in 1.9+.

amacaskill added a commit to amacaskill/gcp-compute-persistent-disk-csi-driver that referenced this pull request Apr 14, 2023

Skip snapshot and restore test because test fix (kubernetes-sigs#972)…

b887a1f

… is not backported to 1.8-. The fix is only in 1.9+.

dannawang0221 mentioned this pull request Sep 7, 2023

Add missing CHANGELOG #1368

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

filesystem is not resized when restoring from snapshot/cloning to larger size than origin #972

filesystem is not resized when restoring from snapshot/cloning to larger size than origin #972

RomanBednar commented May 3, 2022 •

edited

k8s-ci-robot commented May 3, 2022

k8s-ci-robot commented May 3, 2022

RomanBednar commented May 3, 2022

mattcary commented May 3, 2022

mattcary commented May 3, 2022

RomanBednar commented May 10, 2022

mattcary commented May 10, 2022

jsafrane commented May 11, 2022 •

edited

mattcary commented May 11, 2022

jsafrane commented May 12, 2022 •

edited

mattcary commented May 18, 2022

jsafrane commented May 23, 2022

k8s-triage-robot commented Sep 13, 2022

mattcary commented Sep 13, 2022

mattcary commented Oct 25, 2022

k8s-ci-robot commented Oct 25, 2022

filesystem is not resized when restoring from snapshot/cloning to larger size than origin #972

filesystem is not resized when restoring from snapshot/cloning to larger size than origin #972

Conversation

RomanBednar commented May 3, 2022 • edited

k8s-ci-robot commented May 3, 2022

k8s-ci-robot commented May 3, 2022

RomanBednar commented May 3, 2022

mattcary commented May 3, 2022

mattcary commented May 3, 2022

RomanBednar commented May 10, 2022

mattcary commented May 10, 2022

jsafrane commented May 11, 2022 • edited

mattcary commented May 11, 2022

jsafrane commented May 12, 2022 • edited

mattcary commented May 18, 2022

jsafrane commented May 23, 2022

k8s-triage-robot commented Sep 13, 2022

mattcary commented Sep 13, 2022

mattcary commented Oct 25, 2022

k8s-ci-robot commented Oct 25, 2022

RomanBednar commented May 3, 2022 •

edited

jsafrane commented May 11, 2022 •

edited

jsafrane commented May 12, 2022 •

edited