gh-51067: Add `remove()` and `repack()` to `ZipFile` #134627

danny0838 · 2025-05-24T11:16:21Z

This is a revised version of PR #103033, implementing two new methods in zipfile.ZipFile: remove() and repack(), as suggested in this comment.

Features

`ZipFile.remove(zinfo_or_arcname)`

Removes a file entry (by providing a str path or ZipInfo) from the central directory.
If there are multiple file entries with the same path, only one is removed when a str path is provided.
Returns the removed ZipInfo instance.
Supported in modes: 'a', 'w', 'x'.

`ZipFile.repack(removed=None)`

Physically removes stale local file entry data that is no longer referenced by the central directory.
Shrinks the archive file size.
If removed is passed (as a sequence of removed ZipInfos), only their corresponding local file entry data are removed.
Only supported in mode 'a'.

Rationales

Heuristics Used in `repack()`

Since repack() does not immediately clean up removed entries at the time a remove() is called, the header information of removed file entries may be missing, and thus it can be technically difficult to determine whether certain stale bytes are really previously removed files and safe to remove.

While local file entries begin with the magic signature PK\x03\x04, this alone is not a reliable indicator. For instance, a self-extracting ZIP file may contain executable code before the actual archive, which could coincidentally include such a signature, especially if it embeds ZIP-based content.

To safely reclaim space, repack() assumes that in a normal ZIP file, local file entries are stored consecutively:

File entries must not overlap.
- If any entry’s data overlaps with the next, a BadZipFile error is raised and no changes are made.
There should be no extra bytes between entries (or between the last entry and the central directory):
1. Data before the first referenced entry is removed only when it appears to be a sequence of consecutive entries with no extra following bytes; extra preceeding bytes are preserved.
2. Data between referenced entries is removed only when it appears to be a sequence of consecutive entries with no extra preceding bytes; extra following bytes are preserved.

Check the doc in the source code of _ZipRepacker.repack() (which is internally called by ZipFile.repack()) for more details.

Supported Modes

There has been opinions that a repacking should support mode 'w' and 'x' (e. g. #51067 (comment)).

This is NOT introduced since such modes do not truncate the file at the end of writing, and won't really shrink the file size after a removal has been made. Although we do can change the behavior for the existing API, some further care has to be made because mode 'w' and 'x' may be used on an unseekable file and will be broken by such change. OTOH, mode 'a' is not expected to work with an unseekable file since an initial seek is made immediately when it is opened.

Issue: remove/delete method for zipfile objects #51067

📚 Documentation preview 📚: https://cpython-previews--134627.org.readthedocs.build/

bedevere-app · 2025-05-24T11:16:27Z

Most changes to Python require a NEWS entry. Add one using the blurb_it web app or the blurb command-line tool.

If this change has little impact on Python users, wait for a maintainer to apply the skip news label instead.

sharktide

It probably would be better to raise an attributeError instead of a valueError here since you are trying to access an attribute a closed zipfile doesn’t have

danny0838 · 2025-05-24T17:57:57Z

It probably would be better to raise an attributeError instead of a valueError here since you are trying to access an attribute a closed zipfile doesn’t have

This behavior simply resembles open() and write(), which raises a ValueError in various cases. Furthermore there has been a change from raising RuntimeError since Python 3.6:

Changed in version 3.6: Calling open() on a closed ZipFile will raise a ValueError. Previously, a RuntimeError was raised.

Changed in version 3.6: Calling write() on a ZipFile created with mode 'r' or a closed ZipFile will raise a ValueError. Previously, a RuntimeError was raised.

danny0838 · 2025-05-24T18:39:12Z

Nicely inform @ubershmekel, @barneygale, @merwok, and @wimglenn about this PR. This should be more desirable and flexible than the previous PR, although cares must be taken as there might be a potential risk on the algorithm about reclaiming spaces.

The previous PR is kept open in case some folks are interested in it. Will close when either one is accepted.

- Separate individual validation tests. - Check underlying repacker not called in validation. - Use `unlink` to prevent FileNotFoundError. - Fix mode 'x' test.

- Set `_writing` to prevent `open('w').write()` during repacking. - Move the protection logic to `ZipFile.repack()`.

jaraco · 2025-06-15T17:24:06Z

Should strict_descriptor option for repack() default to True or False?

I don't love that there's two implementations here. My reading from the code is that unsigned descriptors are deprecated, so maybe it would be enough to simply omit support for them, opt for the faster implementation, and (maybe) warn if such a descriptor is encountered that it's unsupported. How prevalent are such descriptors?

danny0838 · 2025-06-15T18:13:57Z

Should strict_descriptor option for repack() default to True or False?

I don't love that there's two implementations here. My reading from the code is that unsigned descriptors are deprecated, so maybe it would be enough to simply omit support for them, opt for the faster implementation, and (maybe) warn if such a descriptor is encountered that it's unsupported. How prevalent are such descriptors?

Actually there are three implementations. 😂

Therefore the brief algorithm/condition for the slow scan is:

removed is not provided.
Has a local file header like byte sequence (with PK\x03\x04 signature, flag bit 8, and CRC == compress_size == file_size == 0), which should be very unlikely to happen on random bytes.
A valid signed data descriptor is not seen.
A supported compression method (which can reliably detect stream ending) is not used. (STORED/LZMA) (not sure if LZMA decompressor can be reworked to support it)
strict_descriptor==False

It should be quite unlikely to happen even if strict_descriptor==False, but the problem is still that it may be catastrophically slow once it happens, and could be used intentionally and offensively.

The prevalence of the deprecated unsigned data descriptor is hard to tell. Most apps like WinZip, WinRAR, 7z would probably write no data descriptor since they are generally not used in a streaming condition. For streaming cases I think most developers would simply use the ZIP implementation at their hand, and to answer this question we'd have to check the ZIP implementation of popular programming languages like C, Java, JS, php, C#, etc.

Since this feature has been already implemented, I'd prefer to keep it unless there's a strong reason that it should not exist. Defaulting strict_descriptor to False should be enough if it's not welcome.

A minor reason is that we still cannot clearly say that unsigned data descriptor is not supported even if the slow scan is removed, due to the decompression approach. But maybe another choice is that the decompression approach should also be skipped when strict_descriptor==True.

danny0838 · 2025-06-16T12:01:32Z

Reworked strict_descriptor=True to ignore any scan of unsigned data descriptors. (danny0838/zipremove@1a07dff)

distantvale · 2025-06-18T19:44:38Z

@danny0838 in reply to #51067 (comment) (sorry about fragmenting the discussion) -

I'm generally supportive of the iterative approach of start simple and expand from that.

Having said that, since you already implemented _remove_members, I don't see why not expose it through a public API, given that ZipFile has other examples for functions that receive a list of members (members arg of extractall) - can simply follow the same pattern for consistency.
I do agree it's much more useful - my own local implementation only removes a single member and I do see the performance benefits of having an ability to remove multiple members, as many times I myself need to remove multiple members in many different files and it would be much beneficial.

I'm not fond of the idea of a 2 step removal (remove/repack), since it can leave the file in state that some may see (me included) as corruption - zip files aren't support to have a CD entry without a corresponding LF entry.
I'd rather take an approach that limits certain things (cannot remove a file from a self extracting archive for example),
and I don't see the large benefit of delayed repacking flexibility, seems like rather an edge case (but I might be wrong here).

Also some of the questions (how to handle a folder) are relevant regardless of whether we take the 1 step or 2 step approach.
If the 2 step approach has more support here, I'll take it - we can always improve it in the future, but I'd prefer to keep things simpler when possible.

And last but not least - perfect is the enemy of good - it's not the last python version, and we can always improve as long as the initial version is reasonable, which I believe it is.

danny0838 · 2025-06-19T03:40:01Z

@distantvale

Having said that, since you already implemented _remove_members, I don't see why not expose it through a public API, given that ZipFile has other examples for functions that receive a list of members (members arg of extractall) - can simply follow the same pattern for consistency.

I don't think extractall is a good analogy since it's read-only. Instead you should check for open('w'), write, write_str, and mkdir, each of which is a simple single entry handler.

There are stil design considerations even for your proposed members, for example:

Should there be remove plus removeall, or a remove that accepts either a single entry or a list?
What to do if identical names are passed? Should they be treated as a set and deduplicated, or be removed for multiple times?
What to do if any passed value causes an error (such as non-exist)?

The current implementation of _remove_members actually defers such decisions. This is fine for a private helper method but is probably premature for a public API.

I'm not fond of the idea of a 2 step removal (remove/repack), since it can leave the file in state that some may see (me included) as corruption - zip files aren't support to have a CD entry without a corresponding LF entry.

I don't get this. remove leaves a LF entry without a corresponding CD entry, which is clearly allowed by the ZIP spec and won't be complained by any ZIP app.

I'd rather take an approach that limits certain things (cannot remove a file from a self extracting archive for example), and I don't see the large benefit of delayed repacking flexibility, seems like rather an edge case (but I might be wrong here).

A delayed repacking isn't necessarily done after the archive is closed, it can also happen simply after the method is called and returned.

For example, a repacking after a mixed operations of writing, removing, copying, renaming, and just before archive closing, such as an interactive ZIP file editing tool would do.

It's probably challenging enough to design a one-time remove or removeall to support all such cases.

Also some of the questions (how to handle a folder) are relevant regardless of whether we take the 1 step or 2 step approach.

I don't think there's too much need to dig into such details for low level APIs—just let them work simply like other existing methods. It's not the case for a one-time high level remove—and maybe copy and rename—though.

distantvale · 2025-06-19T07:08:54Z

@danny0838 I can share my own opinions about the desired behavior but of course not everyone would agree:

I don't think extractall is a good analogy since it's read-only. Instead you should check for open('w'), write, write_str, and mkdir, each of which is a simple single entry handler.

I don't see the need to separate into read/write, as I believe they should all be consistent.
If certain functions, such as extract, have 2 versions - for single/multiple files, then it won't be an exception in the API, and if we find it useful, have both.

There are stil design considerations even for your proposed members, for example:

Should there be remove plus removeall, or a remove that accepts either a single entry or a list?

I'd follow extract in this case.

What to do if identical names are passed? Should they be treated as a set and deduplicated, or be removed for multiple times?

I might be wrong here, but I think add allows adding multiple files with the same name, so in that case I'd remove multiple times. If I'mt wrong about add, I'd treat it as a set.

What to do if any passed value causes an error (such as non-exist)?

Just like in extractall, the names must be a subset of namelist().

The current implementation of _remove_members actually defers such decisions. This is fine for a private helper method but is probably premature for a public API.

I don't get this. remove makes a LF entry without a corresponding CD entry, which is clearly allowed by the ZIP spec and won't be complained by any ZIP app.

Ok, fair enough - it just seems less desireable/clean to me, but I might be a minority here.

A delayed repacking isn't necessarily done after the archive is closed, it can also happen simply after the method is called and returned.

For example, a repacking after a mixed operations of writing, removing, copying, renaming, and just before archive closing, such as an interactive ZIP file editing tool would do.

Yes of course, but it might also be called after the file is closed. I can see some performance benefits, but it would seem like something that a low level API would provide, as opposed to a high level one.
But again, I won't die on that hill, if most people find it useful, so be it - either way it's a good way to start from my end.

It's probably challenging enough to design a one-time remove or removeall to support all such cases.

Also some of the questions (how to handle a folder) are relevant regardless of whether we take the 1 step or 2 step approach.

I don't think there's too much need to dig into such details for low level APIs—just let them work simply like other existing methods. It's not the case for a one-time high level remove—and maybe copy and rename—though.

I'm fine with a "naive" API as well, documentation is enough for these cases at this point IMO.

danny0838 · 2025-06-19T13:30:27Z

@distantvale

@danny0838 I can share my own opinions about the desired behavior but of course not everyone would agree:

I don't think extractall is a good analogy since it's read-only. Instead you should check for open('w'), write, write_str, and mkdir, each of which is a simple single entry handler.

I don't see the need to separate into read/write, as I believe they should all be consistent. If certain functions, such as extract, have 2 versions - for single/multiple files, then it won't be an exception in the API, and if we find it useful, have both.

The truth is that they are not consistent. When you say they should be consistent, do you mean there should be multiple file version for open('w'), write, write_str, and mkdir, or there should be no multiple file version for extractall, or every of them should be a single method that supports both single and list input?

There are stil design considerations even for your proposed members, for example:

Should there be remove plus removeall, or a remove that accepts either a single entry or a list?

I'd follow extract in this case.

What to do if identical names are passed? Should they be treated as a set and deduplicated, or be removed for multiple times?

I might be wrong here, but I think add allows adding multiple files with the same name, so in that case I'd remove multiple times. If I'mt wrong about add, I'd treat it as a set.

Actually there is no ZipFile.add. Do you mean write and others?

What to do if any passed value causes an error (such as non-exist)?

Just like in extractall, the names must be a subset of namelist().

In extractall each input is handled one by one, and any error causes subsequent inputs not handled. However the current _remove_members handles removing and repacking simultaneously. If any error happens in the middle, the whole repacking is left partially done and the archive will be in an inconsistent state, which is unlikely an acceptable consequence.

Likewise, for extractall providing duplicated names just extract the same entry to the same filesystem path again; for removing this would be a totally different story.

Anyway, the current _remove_members implementation DOESN'T do what you state above. If you favor that approach and really want such behavior, you'd have to work on that PR, doing more tests and feedback (I'm probably not going to keep working on that unless it's the final decision of the issue). It would also be nice if you can provide the code of the implementation you've been using.

distantvale · 2025-06-19T14:03:06Z

@distantvale

@danny0838 I can share my own opinions about the desired behavior but of course not everyone would agree:

I don't think extractall is a good analogy since it's read-only. Instead you should check for open('w'), write, write_str, and mkdir, each of which is a simple single entry handler.

I don't see the need to separate into read/write, as I believe they should all be consistent. If certain functions, such as extract, have 2 versions - for single/multiple files, then it won't be an exception in the API, and if we find it useful, have both.

The truth is that they are not consistent. When you say they should be consistent, do you mean there should be multiple file version for open('w'), write, write_str, and mkdir, or there should be no multiple file version for extractall, or every of them should be a single method that supports both single and list input?

What I'm saying is that there's already precendence in the package for a function that has 2 versions, so it's not out of the ordinary to add another one.

There are stil design considerations even for your proposed members, for example:

Should there be remove plus removeall, or a remove that accepts either a single entry or a list?

I'd follow extract in this case.

What to do if identical names are passed? Should they be treated as a set and deduplicated, or be removed for multiple times?

I might be wrong here, but I think add allows adding multiple files with the same name, so in that case I'd remove multiple times. If I'mt wrong about add, I'd treat it as a set.

Actually there is no ZipFile.add. Do you mean write and others?

Yup, I mean write.

What to do if any passed value causes an error (such as non-exist)?

Just like in extractall, the names must be a subset of namelist().

In extractall each input is handled one by one, and any error causes subsequent inputs not handled. However the current _remove_members handles removing and repacking simultaneously. If any error happens in the middle, the whole repacking is left partially done and the archive will be in an inconsistent state, which is unlikely an acceptable consequence.

The behavior of extractall might not be aligned with the documentation that states that members must be a subset of namelist() - hard to say what the intention was, but I think it's a legitimate check.
I agree re. the inconsistent state, and maybe that's the most significant reason to keep the external API simple for now. Once the initial version is release, expanding it to more functionality is easier.

Likewise, for extractall providing duplicated names just extract the same entry to the same filesystem path again; for removing this would be a totally different story.

Yeah I agree, and since multiple files with the same names are not very common, I'd opt for treating members as a set.

Anyway, the current _remove_members implementation DOESN'T do what you state above. If you favor that approach and really want such behavior, you'd have to work on that PR, doing more tests and feedback (I'm probably not going to keep working on that unless it's the final decision of the issue). It would also be nice if you can provide the code of the implementation you've been using.

My implementation is based on the original PR, so not much there in regards to these options.

In any case I'd proceed with your approach for the time being.

emmatyping

Some thoughts in an initial pass. This is a pretty big change so I haven't had time to go over the repack implementation in detail. Thank you for working on all of these changes. I can see an impressive attention to detail!

emmatyping · 2025-06-26T01:51:41Z

Doc/library/zipfile.rst

+   If multiple members share the same full path, only one is removed when
+   a path is provided.


I think it would be good to add that which one gets removed is unspecified and should not be relied on (to defensively discourage mis-use).

Actually the current implementation is definite that the one mapped by ZipFile.getinfo(name) will get removed and the last one in the filelist with the same name (if exists) will be the new mapped one.

This is similar to what will be mapped by ZipFile.getinfo(name) if there are multiple zinfos with same name. The current implementation is always the last one in the filelist, though it's also undocumented.

The question is that should we document the definite behavior or state that it's actually undefined and the current behavior should not be relied on? Before the question is solved I would just keep the current statement.

I think I would lean towards saying the behavior is undefined in this method. I would want some discussion about documenting the behavior with multiple zinfos.

I haven't gone back and looked, but I have a vague recollection that multiple zip entries with the same name is/was a "normal" zip file legacy-ish "feature" as that was how replacing the contents of one zip file member was implemented in cases where the entire thing cannot be rewritten as it could be done solely by rewriting the end of the file and central directory.

@gpshead This is the convention of many ZIP tools, including Python, at least currently. Unfortunately it's not clearly documented in the ZIP spec.

Doc/library/zipfile.rst

emmatyping · 2025-06-26T01:56:08Z

Doc/library/zipfile.rst

+   Rewrites the archive to remove stale local file entries, shrinking its file
+   size.
+
+   If *removed* is provided, it must be a sequence of :class:`ZipInfo` objects
+   representing removed entries; only their corresponding local file entries
+   will be removed.
+
+   If *removed* is not provided, the archive is scanned to identify and remove
+   local file entries that are no longer referenced in the central directory.
+   The algorithm assumes that local file entries (and the central directory,
+   which is mostly treated as the "last entry") are stored consecutively:


I think it would be good to define what you mean by a stale local file entry here. In the third paragraph it is somewhat explained but I would suggest adding something earlier on about what a stale file entry is.

I think that the "stale local file entries" is the general abstract concept that this method does (and should do).

According to the context readers should be able to get that "stale local file entries" is defined as "local file entries referenced by the provided removed ZipInfo objects" when removed is provided, and "local file entries that are no longer referenced in the central directory (and meeting the 3 criteria)" when removed is not provided, and potentially another definition if more algorithm/mode is added in the future.

I do can be more explicit by saying "stale local file entries" is defined as above, but it would probably be too redundant.

According to the context readers should be able to get that "stale local file entries" is defined as "local file entries referenced by the provided removed ZipInfo objects" when removed is provided, and "local file entries that are no longer referenced in the central directory (and meeting the 3 criteria)" when removed is not provided, and potentially another definition if more algorithm/mode is added in the future.

Readers should not need to read multiple paragraphs to infer the meaning of a phrase in the first sentence of documentation when it can be briefly defined earlier on instead.

I think defining what "stale" means in a stale local file entry is would be sufficient, as that is not a term coming from appnote.txt, and only introduced in these docs.

As aforementioned, the definition of "stale" is difficult and almost as complex/long as the paragraph 2~4. Even if we provide the "definition" first, the reader still need to read the equally long sentences to get it.

can simply replace "stale" by "a local file entry that doesn't exist in the central directory" or something similar.

What about simply unreferenced local file entries?

ambiguous imo

emmatyping · 2025-06-26T02:41:24Z

Doc/library/zipfile.rst

+   #. Data before the first referenced entry is removed only when it appears to
+      be a sequence of consecutive entries with no extra following bytes; extra
+      preceding bytes are preserved.


This seems hard to work around if this isn't the behavior a user wants. i.e. if the bytes PK\003\004 appear before the first entry in some other format then a user cannot use repack according to my reading of this (I will revisit this once I read the implementation). Perhaps it would be better to take a start_offset (defaulting to ZipFile.data_offset or 0 maybe) that is the offset to start the search?

I think documenting the list of limitations of repack's scan would be an acceptable alternative to adding this.

Can you be more specific about what the "some other format" you are mentioning?

If it's something like:

[PK\003\004 and noise] [unreferenced local file entry] [first local file entry]

then the unreferenced local file entry will be removed (since it's "a sequence of consecutive entries with extra preceding bytes").

If it's something like:

[PK\003\004 and noise] [first local file entry]

or

[unreferenced local file entry] [PK\003\004 and noise] [first local file entry]

then all bytes before the first local file entry will be preserved.

I think this is the reasonable behavior.

There are many formats that build on top of zip, such as doc, java class files, etc. These have prefixes and you need to be sure what you detect is an unreferenced file entry vs a local file entry magic and noise. As far as I am aware, there's no way to be certain what you have is an actual file entry. So there are potentially cases where the following layout could be a misinterpretation of a prefix to a zip file, and repack would incorrectly remove data from that prefix.

[unreferenced local file entry] [PK\003\004 and noise] [first local file entry]

Yes there is no 100% definite way to define a sequence of bytes is a stale local file entry, that's why I call it a heuristic. But I think the current criteria is accurate enough for most real-world cases, and a false removal is very, very unlikely to happen.

If you don't agree with me, can you provide a good example or test case that normal bytes be mis-interepeted and falsely removed as local file entries before the first referenced local file entry? Without this it's kind of like a dry talk, which is non-constructive.

Given that heuristics and data scanning are involved I think it is worth while to add a .. note:: to the repack docs here stating that it (1) cannot be guaranteed" safe to use on untrusted zip file inputs or (2) does not guarantee that zip-like archives such as those with executable data prepended will survive unharmed.

Realistically I think people with (2) style formats should know this but it is good to set expectations.

I'm asking for (1) preemptively because we're basically guaranteed to get security "vulnerability" reports about all possible API behaviors today [I'm on the security team - we really do] so pre-documenting the thing makes replying to those easier. 😺
Given we are never going to be able to prevent all such DoS and zipbomb, frankenzip, or multiple-interpretations-by-different-tooling-zip format reports given the zip format definition combined with the collective behavior of the worlds implementations is so... fuzzy.

can you provide a good example or test case that normal bytes be mis-interepeted and falsely removed as local file entries before the first referenced local file entry?

That's exactly the kind of thing someone would do in a future security@ report. :P We can stay ahead of that by at least not offering a guarantee of safety for this API. The first innocent real world starting points that come to mind are zip files embedded stored within zip files. And file formats involving multiple zip files within them where only one of them appears at the end of the file and thus acts like a zip to normal zip file tooling such as zipfile seeking out the end of file central directory. Those might not trigger a match in your logic as is, but work backwards from the logic and there may be ways to construct some that could? I'm just suggesting we don't try to overthink our robustness and make guarantees here.

For (1): not only ZipFile.repack but the whole zipfile package is not guaranteed to be safe on an untrusted ZIP file. If a note or warning is required, it probably should be for the whole package rather than for this method only.

For (2): this is what repack want to at least guarantee. The current algorithm should be quite safe against a false positive as it's very unlikely that random binary bytes would happen to form a local file header magic and happen to have its "entry length" ends exactly at the position of the first referenced local file header (or the next "local file entry").

This can even be improved by providing an option to check for CRC when validating every seemingly like "local file entry". Though it would impact performance significantly and I don't think it worth.

A prepended zip should be safe since it has a central directory, which will be identified as "extra following bytes" and skipped from stripping. Unless the zip is abnormal, e.g. having the last local file entry overlapping in the central comment and thus having no additional bytes after its end.

A zip embedded as the content of a member is also 100% safe since the algorithm won't strip anything inside a local file entry.

A zip embedded immediately after a local file entry will be falsely stripped, but it's explicitly precluded by the documented presumption that "local file entries are stored consecutively", and should be something unlikely to happen on a normal zip-like file.

Given that the current documentation already explains its assumption and algorithm, I expect that the developer be able to estimate the risk on his own. Although it's not 100% safe, worrying about this may be something like worrying about a repository breaking due to SHA-1 collision when using Git. I agree that it would be good to set a fair expectation on the heuristics based algorithm and encourage the usage of providing removed for better performance and accuracy, but I also don't want to give an impression that the algorithm is something fragile and could easily blow on a random input. Don't you think it's overkill for Git to show a big warning saying that it doesn't guarantee your data won't break accidentally?

Anyway, I'm open to this. It's welcome if someone can provide a good rephasing or note without such issues.

Doc/library/zipfile.rst

Lib/test/test_zipfile/test_core.py

emmatyping · 2025-06-26T19:03:05Z

Lib/zipfile/__init__.py

+            - Modifies the ZIP file in place.
+            - Updates zfile.start_dir to account for removed data.
+            - Sets zfile._didModify to True.
+            - Updates header_offset and clears _end_offset of referenced


It would probably be good to also update data_offset if anything before the existing data_offset is removed.

As #134627 (comment) stated, I'm not quite clear what data_offset is designed for and how it should be updated to.

For the current implementation data_offset will be 0 when the ZipFile is reloaded after doing any modification in 'a' mode. Should we reset it to 0 during a repack?

data_offset is the offset to the beginning of the zip content in a file, since there may be other content before the beginning of the first local entry. I think it might actually be better if it were updated on .remove(), since that is what changes the first file entry in the CD record.

data_offset is the offset to the beginning of the zip content in a file, since there may be other content before the beginning of the first local entry.

As I've pointed out in that comment, data_offset can still be 0 even if there are other content before the beginning of the first local entry, making it a bad indicator.

Without making clear what data_offset ought to do and possibly whether its behavior requires a redesign/change first, we cannot decide how to update its value here.

Quickly implemented updating of data_offset in 725b1a3.

However, the previously mentioned issue for data_offset still exists, and a review and revision of the implementation of data_offset may be needed.

emmatyping · 2025-06-26T19:14:03Z

Lib/zipfile/__init__.py

+        # calculate the starting entry offset (bytes to skip)
+        if removed is None:
+            try:
+                offset = filelist[0].header_offset
+            except IndexError:
+                offset = zfile.start_dir
+            entry_offset = self._calc_initial_entry_offset(fp, offset)
+        else:
+            entry_offset = 0


I think you can use zfile.data_offset if it isn't None and fall back to this code?

No, it doesn't work.

ZipFile.data_offset is always 0 if the offsets in the central directory are calculated from the starting of the file (i.e. bytes before the first entry are included in the calculation), even if filelist[0].header_offset is not 0.

Minimal reproducible example:

with open('archive.zip', 'wb') as fh: fh.write(b'dummy') with zipfile.ZipFile(fh, 'a') as zh: zh.writestr('file', 'file content') with zipfile.ZipFile('archive.zip') as zh: print(zh.data_offset) # 0 print(zh.infolist()[0].header_offset) # 5 print(zh.start_dir) # 51

And also see tests test_repack_bytes_before_first_file and test_repack_prepended_bytes.

This seems to be somehow inconsistent to the documentation of ZipFile.data_offset—The offset to the start of ZIP data from the beginning of the file, but may depend on how the start of ZIP data is interpreted. In some sense it looks like a bug, but I'm not sure as the original issue introducing it (#84481) does not clearly mention what intended use case the attribute is for.

Lib/zipfile/__init__.py

gpshead · 2025-06-29T18:43:11Z

Doc/library/zipfile.rst

+   #. Entries must not overlap. If any entry's data overlaps with another, a
+      :exc:`BadZipFile` error is raised and no changes are made.
+
+   When scanning, setting ``strict_descriptor=True`` disables detection of any


Why not default to strict_descriptor=True given that it performs better and the zip files we expect people to be manipulating in remove/repack manners are presumed most likely to be "modern" forms?

This is exactly one of the open question(#134627 (comment), #134627 (comment)).

The current quick decision is primarily since it adheres better to the spec and most Python stdlib tend to prioritize compatibility than performance. E.g. json.dump with ensure_ascii=True and http.server with HTTP version 1.0. But it's not solid and can be changed, based on a vote or something?

gpshead · 2025-06-29T19:03:56Z

Doc/library/zipfile.rst

+   #. Data before the first referenced entry is removed only when it appears to
+      be a sequence of consecutive entries with no extra following bytes; extra
+      preceding bytes are preserved.


Given that heuristics and data scanning are involved I think it is worth while to add a .. note:: to the repack docs here stating that it (1) cannot be guaranteed" safe to use on untrusted zip file inputs or (2) does not guarantee that zip-like archives such as those with executable data prepended will survive unharmed.

Realistically I think people with (2) style formats should know this but it is good to set expectations.

I'm asking for (1) preemptively because we're basically guaranteed to get security "vulnerability" reports about all possible API behaviors today [I'm on the security team - we really do] so pre-documenting the thing makes replying to those easier. 😺
Given we are never going to be able to prevent all such DoS and zipbomb, frankenzip, or multiple-interpretations-by-different-tooling-zip format reports given the zip format definition combined with the collective behavior of the worlds implementations is so... fuzzy.

can you provide a good example or test case that normal bytes be mis-interepeted and falsely removed as local file entries before the first referenced local file entry?

That's exactly the kind of thing someone would do in a future security@ report. :P We can stay ahead of that by at least not offering a guarantee of safety for this API. The first innocent real world starting points that come to mind are zip files embedded stored within zip files. And file formats involving multiple zip files within them where only one of them appears at the end of the file and thus acts like a zip to normal zip file tooling such as zipfile seeking out the end of file central directory. Those might not trigger a match in your logic as is, but work backwards from the logic and there may be ways to construct some that could? I'm just suggesting we don't try to overthink our robustness and make guarantees here.

Add remove() and repack() to ZipFile

6aed859

bedevere-app bot mentioned this pull request May 24, 2025

remove/delete method for zipfile objects #51067

Open

bedevere-app bot added the awaiting review label May 24, 2025

blurb-it bot and others added 4 commits May 24, 2025 11:17

📜🤖 Added by blurb_it.

5453dbc

Fix and optimize test code

80ab2e2

Handle common setups with setUpClass

72c2a66

Add tests for mode w and x for remove()

a4b410b

This comment was marked as off-topic.

Sign in to view

danny0838 requested a review from sharktide May 24, 2025 17:29

sharktide suggested changes May 24, 2025

View reviewed changes

bedevere-app bot added awaiting core review and removed awaiting review labels May 24, 2025

danny0838 requested a review from sharktide May 24, 2025 17:58

danny0838 added 7 commits May 25, 2025 10:09

Introduce _calc_initial_entry_offset and refactor

a9e85c6

Optimize _calc_initial_entry_offset by introducing cache

236cd06

Introduce _validate_local_file_entry and refactor

bdc58c7

Introduce _debug and refactor

c3c8345

Introduce _move_entry_data and rework chunk_size passing

1b7d75a

Refactor _validate_local_file_entry

51c9254

Add strict_descriptor option

0d971d8

danny0838 force-pushed the gh-51067-2 branch from bc865be to 0d971d8 Compare May 25, 2025 02:19

danny0838 added 6 commits May 25, 2025 16:18

Fix and improve validation tests

8f0a504

- Separate individual validation tests. - Check underlying repacker not called in validation. - Use `unlink` to prevent FileNotFoundError. - Fix mode 'x' test.

Remove obsolete NameToInfo updating

0cb8682

Use zinfo rather than info

a788a00

Raise on overlapping file blocks

ae01b8c

Rework writing protection

edee203

- Set `_writing` to prevent `open('w').write()` during repacking. - Move the protection logic to `ZipFile.repack()`.

Update doc

555ac78

danny0838 mentioned this pull request Jun 19, 2025

gh-51067: add ZipFile.remove() #103033

Open

danny0838 added 4 commits June 20, 2025 15:10

Update NEWS

ce88616

Sync with danny0838/zipremove@1691ca2

5f093e5

Revise NEWS

11c0937

Sync with danny0838/zipremove@1843d87

4b2176e

danny0838 force-pushed the gh-51067-2 branch from 59259dc to 4b2176e Compare June 21, 2025 15:08

danny0838 added 5 commits June 22, 2025 00:00

Fix timezone related timestamp issue

d9824ce

Simplify tests with data descriptors

85811ab

Sync with danny0838/zipremove@e790427

748ac63

Sync with danny0838/zipremove@87bcdb5

001a8d0

Sync with danny0838/zipremove@6a78bd1

3a364ce

emmatyping requested changes Jun 26, 2025

View reviewed changes

danny0838 added 2 commits June 27, 2025 15:41

Sync with danny0838/zipremove@092f98b

0832528

Revise doc for repack

f20ec5d

jaraco approved these changes Jun 28, 2025

View reviewed changes

bedevere-app bot added awaiting merge and removed awaiting core review labels Jun 28, 2025

danny0838 added 2 commits June 29, 2025 11:46

Revise doc for remove

8e69c09

Update data_offset

725b1a3

danny0838 force-pushed the gh-51067-2 branch from 45942ab to 725b1a3 Compare June 29, 2025 03:46

gpshead reviewed Jun 29, 2025

View reviewed changes

danifus mentioned this pull request Jul 1, 2025

Alter ZipFile.data_offset calculation #136170

Open

		If multiple members share the same full path, only one is removed when
		a path is provided.

Uh oh!

gh-51067: Add remove() and repack() to ZipFile #134627

Are you sure you want to change the base?

gh-51067: Add remove() and repack() to ZipFile #134627

Conversation

danny0838 commented May 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Features

ZipFile.remove(zinfo_or_arcname)

ZipFile.repack(removed=None)

Rationales

Heuristics Used in repack()

Supported Modes

Uh oh!

bedevere-app bot commented May 24, 2025

Uh oh!

This comment was marked as off-topic.

Uh oh!

sharktide left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

danny0838 commented May 24, 2025

Uh oh!

danny0838 commented May 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jaraco commented Jun 15, 2025

Uh oh!

danny0838 commented Jun 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

danny0838 commented Jun 16, 2025

Uh oh!

distantvale commented Jun 18, 2025

Uh oh!

danny0838 commented Jun 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

distantvale commented Jun 19, 2025

Uh oh!

danny0838 commented Jun 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

distantvale commented Jun 19, 2025

Uh oh!

emmatyping left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

danny0838 Jun 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

danny0838 Jun 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gh-51067: Add `remove()` and `repack()` to `ZipFile` #134627

gh-51067: Add `remove()` and `repack()` to `ZipFile` #134627

danny0838 commented May 24, 2025 •

edited

Loading

`ZipFile.remove(zinfo_or_arcname)`

`ZipFile.repack(removed=None)`

Heuristics Used in `repack()`

sharktide left a comment •

edited

Loading

danny0838 commented May 24, 2025 •

edited

Loading

danny0838 commented Jun 15, 2025 •

edited

Loading

danny0838 commented Jun 19, 2025 •

edited

Loading

danny0838 commented Jun 19, 2025 •

edited

Loading

danny0838 Jun 27, 2025 •

edited

Loading

danny0838 Jun 30, 2025 •

edited

Loading

danny0838 Jun 27, 2025 •

edited

Loading

danny0838 Jun 27, 2025 •

edited

Loading

danny0838 Jun 30, 2025 •

edited

Loading

danny0838 Jun 27, 2025 •

edited

Loading

danny0838 Jun 27, 2025 •

edited

Loading