gh-135661: Fix parsing start and end tags in HTMLParser #135930

serhiy-storchaka · 2025-06-25T11:46:03Z

Whitespaces no longer accepted between </ and the tag name. E.g. </ script> does not end the script section.
Vertical tabulation (\v) and non-ASCII whitespaces no longer recognized as whitespaces. The only whitespaces are \t\n\r\f .
Null character (U+0000) no longer ends the tag name.
End tag can have attributes and slashes after tag name. It no longer ends after the first > in quoted attribute value. E.g. </script/foo=">"/>.
Multiple slashes and whitespaces between the last attribute and closing > are now accepted in both start and end tags. E.g. <a foo=bar/ //>.
Multiple = between attribute name and value are no longer collapsed. E.g. <a foo==bar> produces attribute "foo" with value "=bar".
Whitespaces between the = separator and attribute name or value are no longer ignored. E.g. <a foo =bar> produces two attributes "foo" and "=bar", both with value None; <a foo= bar> produces two attributes: "foo" with value "" and "bar" with value None.

Issue: HTMLParser differences from the HTML5 specification #135661

* Whitespaces no longer accepted between `</` and the tag name. E.g. `</ script>` does not end the script section. * Vertical tabulation (`\v`) and non-ASCII whitespaces no longer recognized as whitespaces. The only whitespaces are `\t\n\r\f `. * Null character (U+0000) no longer ends the tag name. * End tag can have attributes and slashes after tag name. It no longer ends after the first `>` in quoted attribute value. E.g. `</script/foo=">"/>`. * Multiple slashes and whitespaces between the last attribute and closing `>` are now accepted in both start and end tags. E.g. `<a foo=bar/ //>`. * Multiple `=` between attribute name and value are no longer collapsed. E.g. `<a foo==bar>` produces attribute "foo" with value "=bar". * Whitespaces between the `=` separator and attribute name or value are no longer ignored. E.g. `<a foo =bar>` produces two attributes "foo" and "=bar", both with value None; `<a foo= bar>` produces two attributes: "foo" with value "" and "bar" with value None.

serhiy-storchaka · 2025-06-25T12:44:22Z

I tried to minimize changes and split this PR on several PRs, but they would not be independent, and all these changes are needed to fix the possible XSS.

I am planning further refactoring, but this is only for the main branch.

ezio-melotti · 2025-07-02T14:36:08Z

Lib/html/parser.py

@@ -36,29 +36,33 @@
 #     explode, so don't do it.


I don't know if you saw and heeded the warning or if you just got lucky, but it looks like you were able to change these regex!
Since you renamed locatestarttagend, the comment at line 34 should also be updated.

In addition, make sure that existing comments are still relevant. In particular I would appreciate this for comments linking to specific sections of the HTML5 standard.

There are links below, they still work, although they now redirect to other address. I updated them.

On other hand, section numbers were changed. I updated them in places which I touched.

ezio-melotti · 2025-07-02T14:52:17Z

Lib/html/parser.py

+     )?
+    [\t\n\r\f /]*                   # possibly followed by a space
+   )*
+   >?


These changes make sense to me.

I also noticed that you removed the start from locatestarttagend_tolerant, presumably because you are now using it to find the end of end tags too (which can contain attributes, even if they are invalid).

This variable is not documented however I can see two options:

we consider it private and just rename it;

we create an alias to the old name for backward compatibility, in case someone was using it;

Note that before there was also a set of *_strict variable that got removed, so the _tolerant suffix is no longer needed and it was kept for backward compatibility. Since you are refactoring/renaming (some of) these variables, you might want to consider dropping the _tolerant suffix altogether (and possibly adding aliases to preserve backward compatibility), either in this or in a separate PR.

Restored the removed variables. I will remove them in the main branch in the following PR.

ezio-melotti · 2025-07-02T14:54:18Z

Lib/html/parser.py

@@ -141,7 +145,8 @@ def get_starttag_text(self):

    def set_cdata_mode(self, elem):
        self.cdata_elem = elem.lower()
-        self.interesting = re.compile(r'</\s*%s\s*>' % self.cdata_elem, re.I)
+        self.interesting = re.compile(r'</%s(?=[\t\n\r\f />])' % self.cdata_elem,
+                                      re.IGNORECASE|re.ASCII)


Any reason for adding re.ASCII here?

Yes, it affects case-insensitive mode. Otherwise 'ſ' ~ 's' and 'ı' ~ 'i'. There may be more cases after adding support for title and textarea.

This is not actually a problem in the current code, but future changes could make this important.

ezio-melotti · 2025-07-02T15:01:43Z

Lib/html/parser.py


    # Internal -- parse endtag, return end or -1 if incomplete
    def parse_endtag(self, i):
        rawdata = self.rawdata
        assert rawdata[i:i+2] == "</", "unexpected call to parse_endtag"
-        match = endendtag.search(rawdata, i+1) # >
-        if not match:
+        if rawdata.find('>', i+2) < 0:


Suggested change

if rawdata.find('>', i+2) < 0:

if rawdata.rfind('>', i+2) < 0:

Probably inconsequential performance-wise, but using rfind seems more logical here (and possibly elsewhere).

This check is not actually needed. It is simply an optimization for the case of truncated end tag, because it is faster than endtagopen.match() + locatetagend.match(). I do not know whether it really helps, but I left it as insurance against unpredicted performance degradation.

find may be faster than rfind in general, and in case of end tag, there is large chance to find ">" in first few characters.

Lib/html/parser.py

Misc/NEWS.d/next/Library/2025-06-25-14-13-39.gh-issue-135661.idjQ0B.rst

Lib/test/test_htmlparser.py

Misc/NEWS.d/next/Library/2025-06-25-14-13-39.gh-issue-135661.idjQ0B.rst

Co-authored-by: Ezio Melotti <[email protected]>

…o htmlparser-tag

serhiy-storchaka

Thank you for review, @ezio-melotti.

serhiy-storchaka · 2025-07-02T17:43:12Z

Lib/html/parser.py

@@ -36,29 +36,33 @@
 #     explode, so don't do it.


There are links below, they still work, although they now redirect to other address. I updated them.

On other hand, section numbers were changed. I updated them in places which I touched.

serhiy-storchaka · 2025-07-02T17:45:55Z

Lib/html/parser.py

+     )?
+    [\t\n\r\f /]*                   # possibly followed by a space
+   )*
+   >?


Restored the removed variables. I will remove them in the main branch in the following PR.

serhiy-storchaka · 2025-07-02T17:51:01Z

Lib/html/parser.py

@@ -141,7 +145,8 @@ def get_starttag_text(self):

    def set_cdata_mode(self, elem):
        self.cdata_elem = elem.lower()
-        self.interesting = re.compile(r'</\s*%s\s*>' % self.cdata_elem, re.I)
+        self.interesting = re.compile(r'</%s(?=[\t\n\r\f />])' % self.cdata_elem,
+                                      re.IGNORECASE|re.ASCII)


Yes, it affects case-insensitive mode. Otherwise 'ſ' ~ 's' and 'ı' ~ 'i'. There may be more cases after adding support for title and textarea.

This is not actually a problem in the current code, but future changes could make this important.

serhiy-storchaka · 2025-07-02T17:59:14Z

Lib/html/parser.py


    # Internal -- parse endtag, return end or -1 if incomplete
    def parse_endtag(self, i):
        rawdata = self.rawdata
        assert rawdata[i:i+2] == "</", "unexpected call to parse_endtag"
-        match = endendtag.search(rawdata, i+1) # >
-        if not match:
+        if rawdata.find('>', i+2) < 0:


This check is not actually needed. It is simply an optimization for the case of truncated end tag, because it is faster than endtagopen.match() + locatetagend.match(). I do not know whether it really helps, but I left it as insurance against unpredicted performance degradation.

find may be faster than rfind in general, and in case of end tag, there is large chance to find ">" in first few characters.

Misc/NEWS.d/next/Library/2025-06-25-14-13-39.gh-issue-135661.idjQ0B.rst

Lib/test/test_htmlparser.py

ezio-melotti · 2025-07-02T21:39:16Z

Lib/html/parser.py

@@ -36,29 +36,33 @@
 #     explode, so don't do it.


serhiy-storchaka requested a review from ezio-melotti as a code owner June 25, 2025 11:46

serhiy-storchaka added needs backport to 3.13 bugs and security fixes needs backport to 3.14 bugs and security fixes labels Jun 25, 2025

bedevere-app bot added the awaiting core review label Jun 25, 2025

bedevere-app bot mentioned this pull request Jun 25, 2025

HTMLParser differences from the HTML5 specification #135661

Open

Fix Sphinx errors.

182b16f

ezio-melotti reviewed Jul 2, 2025

View reviewed changes

serhiy-storchaka and others added 4 commits July 2, 2025 20:17

Merge branch 'main' into htmlparser-tag

436a8a9

Apply suggestions from code review

ebf8ce3

Co-authored-by: Ezio Melotti <[email protected]>

Merge remote-tracking branch 'refs/remotes/origin/htmlparser-tag' int…

d05303b

…o htmlparser-tag

Address review comments.

955db4e

serhiy-storchaka commented Jul 2, 2025

View reviewed changes

ezio-melotti approved these changes Jul 2, 2025

View reviewed changes

Lib/html/parser.py

@@ -36,29 +36,33 @@

# explode, so don't do it.

Copy link

Member

ezio-melotti Jul 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

bedevere-app bot added awaiting merge and removed awaiting core review labels Jul 2, 2025

	if rawdata.find('>', i+2) < 0:
	if rawdata.rfind('>', i+2) < 0:

Uh oh!

gh-135661: Fix parsing start and end tags in HTMLParser #135930

Are you sure you want to change the base?

gh-135661: Fix parsing start and end tags in HTMLParser #135930

Conversation

serhiy-storchaka commented Jun 25, 2025 • edited by bedevere-app bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

serhiy-storchaka commented Jun 25, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

serhiy-storchaka left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

serhiy-storchaka commented Jun 25, 2025 •

edited by bedevere-app bot

Loading