Make WordPress Core

Opened 6 months ago

Closed 3 months ago

Last modified 3 months ago

#63738 closed enhancement (fixed)

HTML API: Internal updates in 6.9

Reported by: dmsnell's profile dmsnell Owned by: dmsnell's profile dmsnell
Milestone: 6.9 Priority: normal
Severity: normal Version: 6.9
Component: HTML API Keywords: has-patch has-unit-tests
Focuses: Cc:

Description

Tracking ticket for updates to the HTML API itself in 6.9.

  • Interface enhancements
  • Performance improvements
  • Spec-compatability improvements
  • Other internal details.

Change History (36)

This ticket was mentioned in PR #9231 on WordPress/wordpress-develop by @jonsurrell.


6 months ago
#1

  • Keywords has-patch has-unit-tests added

Trac ticket: Core-63738

Simplify META tag handling with regards to encoding confidence.

With any of the standard HTML Processor creation paths (WP_HTML_Processor::create_fragment() and WP_HTML_Processor::create_full_parser()), the encoding is never set to tentative. It is either irrelevant (on fragments) or certain on full parsers. In both cases, only the UTF-8 encoding is supported.

This is an optimization to lift a necessary condition to the top level and avoid processing in the case we expect to encounter in almost every case.

---

Fragments follow this path to set irrelevant:
https://github.com/WordPress/wordpress-develop/blob/1c8185e12dd9fb1606eb22118f09aef46e6737d0/src/wp-includes/html-api/class-wp-html-processor.php#L319
https://github.com/WordPress/wordpress-develop/blob/1c8185e12dd9fb1606eb22118f09aef46e6737d0/src/wp-includes/html-api/class-wp-html-processor.php#L547

And full parsers set certain here:
https://github.com/WordPress/wordpress-develop/blob/1c8185e12dd9fb1606eb22118f09aef46e6737d0/src/wp-includes/html-api/class-wp-html-processor.php#L345

Trac ticket:

This ticket was mentioned in PR #9301 on WordPress/wordpress-develop by @jonsurrell.


6 months ago
#2

Trac ticket: Core-63738

This ticket was mentioned in PR #9230 on WordPress/wordpress-develop by @jonsurrell.


6 months ago
#3

Trac ticket: Core-63738

Reduce the number of length checks in WP_HTML_Tag_Processor::skip_script_data().

#5 @jonsurrell
6 months ago

In 60501:

HTML API: Mark doctype info class as private.

Developed in https://github.com/WordPress/wordpress-develop/pull/9301.

Props jonsurrell, dmsnell.
See #63738.

@jonsurrell commented on PR #9301:


6 months ago
#6

Merged in [60501].

#7 @jonsurrell
6 months ago

In 60502:

HTML API: Simplify META tag processing.

META tag processing can be simplified in most cases. Add an early check and return to avoid additional processing.

Developed in https://github.com/WordPress/wordpress-develop/pull/9231.

Props jonsurrell, dmsnell.
See #63738.

@jonsurrell commented on PR #9231:


6 months ago
#8

Merged in [60502].

@mukesh27 commented on PR #9318:


6 months ago
#10

Thanks for quick PR 🚀

#11 @jonsurrell
6 months ago

In 60503:

HTML API: Make META tag test data provider static.

Developed in https://github.com/WordPress/wordpress-develop/pull/9318.

Follow-up to [60502].

Props jonsurrell, mukesh27.
See #63738.

@jonsurrell commented on PR #9318:


6 months ago
#12

Merged in [60503].

@jonsurrell commented on PR #9230:


6 months ago
#13

I've updated the comment and added a number of tests for different script data states. @dmsnell I'd appreciate if you could take another quick look.

This ticket was mentioned in PR #9397 on WordPress/wordpress-develop by @jonsurrell.


6 months ago
#14

This includes https://github.com/WordPress/wordpress-develop/pull/9230 (merged here).

Trac ticket: https://core.trac.wordpress.org/ticket/63738

Address two small HTML API mis-parses of script contents.

---

<script>
`. This abruptly closed empty comment does not transition to _escaped_, but remains in the _unescaped_ state.

In the example, the _double-escaped_ state should not be reached on `<script>` (because the processor should not be in the _escaped_ state) and the script tag should close correctly at `</script>`.

[https://playground.wordpress.net/?plugin=html-api-debugger&url=%2Fwp-admin%2Fadmin.php%3Fpage%3Dhtml-api-debugger%26html%3D%253Cscript%253E%250A%253C%2521--%253E%250A%253Cscript%253E%250A%253C%252Fscript%253E%25F0%259F%258E%2589&wp=6.8.2 before] / [https://playground.wordpress.net/?plugin=html-api-debugger&url=%2Fwp-admin%2Fadmin.php%3Fpage%3Dhtml-api-debugger%26html%3D%253Cscript%253E%250A%253C%2521--%253E%250A%253Cscript%253E%250A%253C%252Fscript%253E%25F0%259F%258E%2589&core-pr=9397 after]

---

```html
<script>
<!--
<script</script>🎉

In this case <script< is correctly recognized as _not_ a sequence that should transition from _escaped_ to _double-escaped_, however it incorrectly advances beyond the following < character that starts the script close tag and does not close correctly at </script>.

before / after

@jonsurrell commented on PR #9230:


6 months ago
#15

I've pulled the updated comment and moved the test adjacent to the other script parsing test, things which were noted in https://github.com/WordPress/wordpress-develop/pull/9397.

#16 @jonsurrell
6 months ago

In 60617:

HTML API: Reduce length checks in skip_script_data.

Apply an optimization to remove several repeated string length checks in WP_HTML_Tag_Processor::skip_script_data().

Developed in https://github.com/WordPress/wordpress-develop/pull/9230.

Props jonsurrell, dmsnell.
See #63738.

@jonsurrell commented on PR #9230:


6 months ago
#17

Merged in [60617].

@jonsurrell commented on PR #9397:


6 months ago
#18

Good reminder on the coverage reports. I've added more tests and all the lines in skip_script_data are now hit with the exception of this condition, the body of this is never entered:

https://github.com/WordPress/wordpress-develop/blob/057f585b2e7698038ff44d299a2cfdb394318a61/src/wp-includes/html-api/class-wp-html-tag-processor.php#L1631-L1633

I think that it's impossible to hit now (after [60617] / https://github.com/WordPress/wordpress-develop/pull/9230). It may always have been impossible. However, I don't mind leaving that check even if it's redundant.

I believe the condition would be hit on inputs like <script></script that end right after the closing script tag name. With the early length check it shouldn't be possible to hit this condition because the function would already have returned false here:

https://github.com/WordPress/wordpress-develop/blob/057f585b2e7698038ff44d299a2cfdb394318a61/src/wp-includes/html-api/class-wp-html-tag-processor.php#L1531-L1533

This ticket was mentioned in PR #9402 on WordPress/wordpress-develop by @jonsurrell.


6 months ago
#19

Trac ticket: https://core.trac.wordpress.org/ticket/63738

Address a minor HTML API mis-parse of script contents, where \f is not recognized as a valid trailing/termination character of SCRIPT tag names.

---

In this case, the \f _form feed_ should be recognized as the end of the script tag name and close the script:

<script></script␌>🎉

before / after

---

This is another example of the same issue. Again, \f _form feed_ should be recognized as the end of the script tag name and enter the double-escaped state (this script tag is not closed and consumes the rest of the document):

<script><!--<script␌</script>🎉

In this case <script< is correctly recognized as _not_ a sequence that should transition from _escaped_ to _double-escaped_, however it incorrectly advances beyond the following < character that starts the script close tag and does not close correctly at </script>.

before / after

@dmsnell commented on PR #9397:


6 months ago
#20

Is it possible to have Copilot not pollute the PR with its comments? Can it not run locally, or at least leave comments as the result of some command you run? or is it only designed to run from the Github interface and leave erroneous comments?

@jonsurrell commented on PR #9397:


6 months ago
#21

Thanks for diving in and doing your own investigation, I'm happy to have my research confirmed ❤️

The blog post we're referencing is available here: Safe JSON in script tags: How not to break a site

On that note, the way I read your diagram, the </script transitions into Close SCRIPT tag are misleading, because the only exit out of the element is a complete closing tag whose name is SCRIPT. This means that if you encounter </script- in those places, we stay in script data or in script data escaped.

That's a good point, it was a bit ambiguous. I alluded to the trailing character in a foot note but it wasn't sufficiently clear. I added a note after the diagram to help clarify because this is an important point.

@dmsnell commented on PR #9397:


6 months ago
#22

It’s fine to keep these separate, but would we want to combine this with #9402? Maybe merge it into here?

This is mostly about accounting and not about code.

#25 @jonsurrell
5 months ago

In 60649:

HTML API: Improve script tag escape state processing.

Addresses some edge cases parsing of script tag contents:

  • "<!-->" remains in the unescaped state and does not enter the escaped state.
  • Contents in the escaped state that end with "<script" do not enter double-escaped state.
  • "\f" (Form Feed) was missing as a tag name terminating character.

Developed in https://github.com/WordPress/wordpress-develop/pull/9397 and https://github.com/WordPress/wordpress-develop/pull/9402.

Props jonsurrell, dmsnell.
See #63738.

@jonsurrell commented on PR #9397:


5 months ago
#26

Merged in [60649].

This ticket was mentioned in PR #9560 on WordPress/wordpress-develop by @jonsurrell.


5 months ago
#27

The WP_Tag_Processor::set_modifiable_text() method should treat <script the same way as it treats </script. Either of these sequences may affect the script elements close.

Trac ticket: https://core.trac.wordpress.org/ticket/63738

@jonsurrell commented on PR #9560:


5 months ago
#28

This could allow some more safe inputs with a regular expression based check. There may also be ways to escape the script tag contents.

Before exploring those improvements, I'd like to address this issue where some potentially dangerous script tag contents are allowed.

#29 @jonsurrell
5 months ago

In 60706:

HTML API: Prevent adding dangerous double-escape SCRIPT contents.

Prevent WP_Tag_Processor::set_modifiable_text() from allowing SCRIPT contents with "<script" like it does with "</script". Either of these sequences may affect the script element's close.

Developed in https://github.com/WordPress/wordpress-develop/pull/9560.

Props jonsurrell, westonruter, dmsnell.
See #63738.

@jonsurrell commented on PR #9560:


5 months ago
#30

Merged in [60706].

This ticket was mentioned in Slack in #core by welcher. View the logs.


4 months ago

#32 @westonruter
3 months ago

  • Owner set to dmsnell
  • Status changed from new to assigned

#33 @wildworks
3 months ago

@dmsnell The 6.9 Beta1 release is coming soon, can this ticket be closed now? Are there any remaining tasks that need to be addressed on this ticket?

#34 @dmsnell
3 months ago

@wildworks I’m still working on these; I don’t know what the practice is, but if we could wait to close them until the Beta is out that would be a big help to me (keep me from having to address the same comment or closure multiple times).

Either way, if you need to close them I won’t stand in your way, but I will continue to try and merge work from them before the Beta.

#35 @wildworks
3 months ago

  • Resolution set to fixed
  • Status changed from assigned to closed

As the Beta1 release begins, I will close this ticket. If there are any other issues that need to be addressed, please leave a comment.

This ticket was mentioned in Slack in #core by wildworks. View the logs.


3 months ago

Note: See TracTickets for help on using tickets.