Skip to content

gh-51067: Add remove() and repack() to ZipFile #134627

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 69 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
69 commits
Select commit Hold shift + click to select a range
6aed859
Add `remove()` and `repack()` to `ZipFile`
danny0838 May 24, 2025
5453dbc
📜🤖 Added by blurb_it.
blurb-it[bot] May 24, 2025
80ab2e2
Fix and optimize test code
danny0838 May 24, 2025
72c2a66
Handle common setups with `setUpClass`
danny0838 May 24, 2025
a4b410b
Add tests for mode `w` and `x` for `remove()`
danny0838 May 24, 2025
a9e85c6
Introduce `_calc_initial_entry_offset` and refactor
danny0838 May 24, 2025
236cd06
Optimize `_calc_initial_entry_offset` by introducing cache
danny0838 May 24, 2025
bdc58c7
Introduce `_validate_local_file_entry` and refactor
danny0838 May 24, 2025
c3c8345
Introduce `_debug` and refactor
danny0838 May 24, 2025
1b7d75a
Introduce `_move_entry_data` and rework chunk_size passing
danny0838 May 25, 2025
51c9254
Refactor `_validate_local_file_entry`
danny0838 May 25, 2025
0d971d8
Add `strict_descriptor` option
danny0838 May 25, 2025
8f0a504
Fix and improve validation tests
danny0838 May 25, 2025
0cb8682
Remove obsolete NameToInfo updating
danny0838 May 25, 2025
a788a00
Use `zinfo` rather than `info`
danny0838 May 25, 2025
ae01b8c
Raise on overlapping file blocks
danny0838 May 25, 2025
edee203
Rework writing protection
danny0838 May 25, 2025
555ac78
Update doc
danny0838 May 25, 2025
95fde31
Fix typo
danny0838 May 26, 2025
8a448e4
Add test for bytes between file entries
danny0838 May 26, 2025
4c35eb2
Check `testzip()` after zip file closed
danny0838 May 26, 2025
926338c
Support `repack(removed)`
danny0838 May 26, 2025
e76f9a1
Fix bytes between entries be removed when `removed` is passed
danny0838 May 26, 2025
93f4c25
Fix bad test code
danny0838 May 26, 2025
9e94209
Revise docstring
danny0838 May 27, 2025
3ef72c6
Add `tearDown` for tests
danny0838 May 28, 2025
fbf7588
Rename methods and parameters
danny0838 May 28, 2025
81a419a
Adjust parameter order
danny0838 May 28, 2025
c62a455
Optimize code and revise comment
danny0838 May 28, 2025
a05353c
Improve debug for `_ZipRepacker.repack()`
danny0838 May 29, 2025
3d0240c
Rework `_validate_local_file_entry_sequence` to return size or None
danny0838 May 29, 2025
31c4c93
Rework `_validate_local_file_entry_sequence` to allow passing no `che…
danny0838 May 29, 2025
f8fade1
Introduce `_scan_data_descriptor_no_sig_by_decompression`
danny0838 May 30, 2025
c80d21b
Strip only entries immediately following a referenced entry
danny0838 May 29, 2025
e1caea9
Adjust method names
danny0838 May 30, 2025
2b23d46
Add memory usage test
danny0838 May 30, 2025
de4f15b
Fix rst
danny0838 May 30, 2025
ea3259f
Optimize code
danny0838 Jun 1, 2025
fef92c4
Fix and optimize `_iter_scan_signature`
danny0838 Jun 1, 2025
8067b0c
Fix `_scan_data_descriptor`
danny0838 Jun 1, 2025
92d3a9c
Fix and optimize `_scan_data_descriptor_no_sig`
danny0838 Jun 1, 2025
b5d7ae3
Rename `_trace_compressed_block_end`
danny0838 Jun 1, 2025
1d5ec61
Fix `_scan_data_descriptor_no_sig_by_decompression`
danny0838 Jun 1, 2025
db9d0d6
Add tests for `_ZipRepacker`
danny0838 Jun 1, 2025
aaa566c
Remove unneeded import
danny0838 Jun 1, 2025
578c7c8
Add requirements
danny0838 Jun 1, 2025
c470c33
Fix `_scan_data_descriptor_no_sig_by_decompression` when library not …
danny0838 Jun 1, 2025
b1dcb07
Test with pre-calculated CRC
danny0838 Jun 1, 2025
04cddef
Remove unneeded import
danny0838 Jun 1, 2025
797a62c
Fix and optimize `repack`
danny0838 Jun 1, 2025
3b2f232
Remove unneeded catch type
danny0838 Jun 14, 2025
cb549c9
Patch more explicitly
danny0838 Jun 14, 2025
0f50a6f
Remove unneeded variables
danny0838 Jun 14, 2025
c759b63
Improve dependency check for decompression tests
danny0838 Jun 14, 2025
1ece5b1
Refactor and optimize `RepackHelperMixin`
danny0838 Jun 14, 2025
ce88616
Update NEWS
danny0838 Jun 20, 2025
5f093e5
Sync with danny0838/zipremove@1691ca25bf971cf1e45d5ed7d22c512636f20cb8
danny0838 Jun 20, 2025
11c0937
Revise NEWS
danny0838 Jun 20, 2025
4b2176e
Sync with danny0838/zipremove@1843d87b70e6cb129fb55446eaf4486a87d2af4d
danny0838 Jun 21, 2025
d9824ce
Fix timezone related timestamp issue
danny0838 Jun 21, 2025
85811ab
Simplify tests with data descriptors
danny0838 Jun 22, 2025
748ac63
Sync with danny0838/zipremove@e79042768f3c2541e0226f6bed3a9ff2ee04fac0
danny0838 Jun 23, 2025
001a8d0
Sync with danny0838/zipremove@87bcdb50411a355d24c35f31dcbe4273c0568cf8
danny0838 Jun 24, 2025
3a364ce
Sync with danny0838/zipremove@6a78bd15de87afde510f8a1b6364365c6e17f252
danny0838 Jun 25, 2025
0832528
Sync with danny0838/zipremove@092f98b4d7b3a0cd335fe4ba64e7090ebb3dc6da
danny0838 Jun 27, 2025
f20ec5d
Revise doc for `repack`
danny0838 Jun 28, 2025
8e69c09
Revise doc for `remove`
danny0838 Jun 28, 2025
725b1a3
Update `data_offset`
danny0838 Jun 29, 2025
9e82bb7
Revise doc for `repack`
danny0838 Jul 1, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
63 changes: 63 additions & 0 deletions Doc/library/zipfile.rst
Original file line number Diff line number Diff line change
Expand Up @@ -518,6 +518,69 @@ ZipFile Objects
.. versionadded:: 3.11


.. method:: ZipFile.remove(zinfo_or_arcname)

Removes a member entry from the archive's central directory.
*zinfo_or_arcname* may be the full path of the member or a :class:`ZipInfo`
instance. If multiple members share the same full path and the path is
provided, only one of them is removed.

The archive must be opened with mode ``'w'``, ``'x'`` or ``'a'``.

Returns the removed :class:`ZipInfo` instance.

Calling :meth:`remove` on a closed ZipFile will raise a :exc:`ValueError`.

.. note::
This method only removes the member's entry from the central directory,
making it inaccessible to most tools. The member's local file entry,
including content and metadata, remains in the archive and is still
recoverable using forensic tools. Call :meth:`repack` afterwards to
completely remove the member and reclaim space.

.. versionadded:: next


.. method:: ZipFile.repack(removed=None, *, \
strict_descriptor=False[, chunk_size])

Rewrites the archive to remove unreferenced local file entries, shrinking
its file size. The archive must be opened with mode ``'a'``.

If *removed* is provided, it must be a sequence of :class:`ZipInfo` objects
representing the recently removed members, and only their corresponding
local file entries will be removed. Otherwise, the archive is scanned to
locate and remove local file entries that are no longer referenced in the
central directory.

When scanning, setting ``strict_descriptor=True`` disables detection of any
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not default to strict_descriptor=True given that it performs better and the zip files we expect people to be manipulating in remove/repack manners are presumed most likely to be "modern" forms?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is exactly one of the open question(#134627 (comment), #134627 (comment)).

The current quick decision is primarily since it adheres better to the spec and most Python stdlib tend to prioritize compatibility than performance. E.g. json.dump with ensure_ascii=True and http.server with HTTP version 1.0. But it's not solid and can be changed, based on a vote or something?

entry using an unsigned data descriptor (a format deprecated by the ZIP
specification since version 6.3.0, released on 2006-09-29, and used only by
some legacy tools), which is significantly slower to scan (around 100 to
1000 times). This does not affect performance on entries without such
feature.

*chunk_size* may be specified to control the buffer size when moving
entry data (default is 1 MiB).

Calling :meth:`repack` on a closed ZipFile will raise a :exc:`ValueError`.

.. note::
The scanning algorithm is heuristic-based and assumes that the ZIP file
is normally structured—for example, with local file entries stored
consecutively, without overlap or interleaved binary data. Prepended
binary data, such as a self-extractor stub, is recognized and preserved
unless it happens to contain bytes that coincidentally resemble a valid
local file entry in multiple respects—an extremely rare case. Embedded
ZIP payloads are also handled correctly, as long as they follow normal
structure. However, the algorithm does not guarantee correctness or
safety on untrusted or intentionally crafted input. It is generally
recommended to provide the *removed* argument for better reliability and
performance.

.. versionadded:: next


The following data attributes are also available:

.. attribute:: ZipFile.filename
Expand Down
Loading
Loading