Make WordPress Core

Opened 3 months ago

Closed 3 weeks ago

#64394 closed defect (bug) (fixed)

HTML Processor may error processing nested HTML structures

Reported by: jonsurrell's profile jonsurrell Owned by: jonsurrell's profile jonsurrell
Milestone: 7.0 Priority: normal
Severity: normal Version: 6.4
Component: HTML API Keywords: has-patch has-unit-tests
Focuses: Cc:

Description

The HTML Processor (not the Tag Processor) may throw an error when processing HTML with very deep (>100 element) nesting:

<i><!-- …99 <i> tags… --><i>

It will throw Exception: could not allocate bookmark.

Change History (26)

#1 @dmsnell
3 months ago

Is this specifically about nesting depth? Or it could occur also for documents with certain specific kinds of tags. I think a reasonable option in any case is to catch and abort in these cases, but we might want to examine the throwing behavior of the underlying Tag Processor 🤷‍♂️

There are discussions about capping nesting depth in the spec itself, so we should try to align over time with that. According to this discussion, there are already limits in Safari, Firefox, and Chrome. We could potentially adjust our arbitrary limit to match those.

Chrome&Firefox the max depth is 513, while in Safari it is 512

Also interesting from that issue is this note

at some point it stops nesting elements, and instead adds them at sibling. In all browser, there is some MAX_DEPTH, and if the stack of open elements has already a length of MAX_DEPTH then before pushing a new element to the stack they'll pop the previous one

Of course this isn’t specified behavior, but it certainly establishes a baseline for expectations we could follow.

#2 @jonsurrell
3 months ago

  • Version set to 6.4

#3 @jonsurrell
3 months ago

Nesting is the most obvious way for the error to happen. Internally, each element on the stack of open elements gets a bookmark. When nesting causes the stack to have more bookmarks than are allowed, this error is triggered.

Internal and user bookmarks are shared, so if many bookmarks are being set externally it would lower the nesting threshold for triggering this error.

There are also bookmarks associated with virtual tokens, but that's a form of nesting. For example <table><td> is apparently 2 tags, but when processed via ::create_fragment() this actually consumes a maximum of 6 bookmarks I believe. 2 bookmarks for the root and context (HTML > BODY) then 4 more elements (TABLE > TBODY (virtual) > TR (virtual) > TD).

This ticket was mentioned in PR #10616 on WordPress/wordpress-develop by @jonsurrell.


3 months ago
#4

  • Keywords has-patch has-unit-tests added

Bookmark exhaustion, typically from deep nesting, can cause the HTML Processor to throw an Exception.

The Exception is thrown by a private method, handle the exception and return false to indicate a failure to process.

Trac ticket: https://core.trac.wordpress.org/ticket/64394

#5 @dmsnell
3 months ago

For example <table><td> is apparently 2 tags, but when processed via ::create_fragment() this actually consumes a maximum of 6 bookmarks

This would correspond with the actual DOM nesting though, so I think that’s fine.

Even if we leave some buffer, for example, setting a MAX_BOOKMARKS count of something like 600 or 1000, we will then have plenty of opportunity to parse all documents a browser would parse.

The bookmark limit was always meant to be tunable and mostly serve to prevent infinite loops and needlessly wasteful code. There is no intrinsic constraint they are guarding.

#6 @jonsurrell
3 months ago

I agree, it seems like MAX_BOOKMARKS can be increased. However, I'd keep that work separate from this ticket.

#7 @jonsurrell
2 months ago

  • Milestone changed from Awaiting Review to 7.0

This ticket was mentioned in PR #10820 on WordPress/wordpress-develop by @jonsurrell.


7 weeks ago
#8

Bookmark exhaustion has been observed on fairly standard WordPress post content. This is partially related to WordPress/gutenberg#70542 which may nest formatting elements to extreme depths.

The initial limit of 100 bookmarks was very conservative. A significant increase in bookmarks seems to perform well without excessive memory usage.

Related to #10616.

@jonsurrell commented on PR #10616:


7 weeks ago
#9

I definitely don’t love how pervasive and deep our error-checking has become here.

Me neither 😕

did you attempt to use ->bail()?

I did try using bail(). That transforms a bookmark exhaustion error into an unsupported error, which seemed confusing. When bookmarks are exhausted, an error is set. Bail overrides with an unsupported error code.

We could consider having bail leave the last error in case it's already been set.

what if we continue to throw inside bookmark_token() but trap that and handle aborting the same way we handle all deep control flow issues?

I have that in an earlier commit. It wasn't as pervasive (at b02d6793b86f8c982401cb367ce5cd1d90ce5831):

https://github.com/WordPress/wordpress-develop/blob/b02d6793b86f8c982401cb367ce5cd1d90ce5831/src/wp-includes/html-api/class-wp-html-processor.php#L1043-L1050
https://github.com/WordPress/wordpress-develop/blob/b02d6793b86f8c982401cb367ce5cd1d90ce5831/src/wp-includes/html-api/class-wp-html-processor.php#L1157-L1169

@dmsnell commented on PR #10616:


7 weeks ago
#10

I have that in an earlier commit. It wasn't as pervasive

Why did you change it @sirreal? did I ask you to? 🙃 🤦‍♂️

@jonsurrell commented on PR #10616:


7 weeks ago
#11

No, I was just iterating and exploring. The method actually already had a …|false return type annotation, suggesting it may have intended to return false on failure.

I can't say I like the general Exception catching either, but it works. I don't have a strong preference, would you prefer that other version @dmsnell?

@dmsnell commented on PR #10820:


6 weeks ago
#12

What about increasing \WP_HTML_Tag_Processor::MAX_BOOKMARKS as well?

The issue is more relevant on the HTML Processor because it creates its own bookmarks for each token as it processes them. The Tag Processor has no reason to do so, so a small limit is a reasonable guard for developers who create bookmarks in a loop, e.g. $processor->set_bookmark( "item-{$i}" );. If they need more they can subclass.

We want to make sure we can properly parse the HTML documents in the HTML Processor, which requires more internal bookmarks. @sirreal did explore bypassing this issue entirely by creating what is effectively a separate namespace for internal bookmarks. Given that we chose this value somewhat arbitrarily and saw in practice that it was simply too low, I think it’s pragmatic to update only here.

@dmsnell commented on PR #10616:


6 weeks ago
#13

@sirreal is there a copy of the exception-catching version available?

what do you not like about it, or prefer about adding |false to all of the methods?

@jonsurrell commented on PR #10616:


6 weeks ago
#14

is there a copy of the exception-catching version available?

@dmsnell This diff should give an idea

what do you not like about it, or prefer about adding |false to all of the methods?

I don't feel strongly one way or the other.

Relying on error handling does reduce the checks required. My biggest concern is that it depends on methods being called inside of try/catch structures, which is easy to overlook. Handling a false return also requires some discipline, but at least the check is local and doesn't require understanding a call stack.

Some background to consider (all considering trunk):

---

The main thing I disliked was the general Exception catching that was required. It's possible for the last_error to be set and another Exception to arise:

https://github.com/WordPress/wordpress-develop/blob/b02d6793b86f8c982401cb367ce5cd1d90ce5831/src/wp-includes/html-api/class-wp-html-processor.php#L1043-L1050
https://github.com/WordPress/wordpress-develop/blob/b02d6793b86f8c982401cb367ce5cd1d90ce5831/src/wp-includes/html-api/class-wp-html-processor.php#L1157-L1169

Perhaps that check should have compared the known message:
https://github.com/WordPress/wordpress-develop/blob/b02d6793b86f8c982401cb367ce5cd1d90ce5831/src/wp-includes/html-api/class-wp-html-processor.php#L5125

I also considered introducing another HTML API Exception class, but I'm reluctant to introduce another exception class for this.

@jonsurrell commented on PR #10820:


6 weeks ago
#15

@sirreal did explore bypassing this issue entirely by creating what is effectively a separate namespace for internal bookmarks

One clarification, I don't believe that was in this codebase but in the Rust port.

I do like the idea of keeping internal and external bookmarks separate. Internal bookmarks could then also use numeric arrays and likely most bookmark operations would be push/pop operations on a stack with potential performance implications. It's a rainy day idea 🙂

@dmsnell commented on PR #10820:


6 weeks ago
#16

:shipit: getting this in will practically eliminate the need for #10616. while it’s good to continue on that one, it will be better to have some time to think about it without worrying that stuff is crashing

@dmsnell commented on PR #10616:


4 weeks ago
#17

The exception-throwing and catching version definitely suits me more, notably because it avoids exporting these details to the callers.

I also think this is a middle-path to excluding internal bookmarks, so spreading out the protection throughout the class seems like it’s going to be code we then have to find and remove later (though maybe the recent PHPStan changes will help surface those cases).

which is easy to overlook

If our goal is to never fail _under normal usage_ and the failed bookmarks are only there to prevent infinite loops or memory-abuse, then perhaps we add a test to ensure that setting the bookmark fails and also that it doesn’t throw.

#18 @desrosj
3 weeks ago

In 61726:

Build/Test Tools: Exclude gutenberg directory when running PHPCS.

The relevant files will be scanned within src (provided the build script has been run).

See #64394.

#19 @desrosj
3 weeks ago

Apologies, [61726] was intended for #64393.

@jonsurrell commented on PR #10616:


3 weeks ago
#20

The exception-throwing and catching version definitely suits me more, notably because it avoids exporting these details to the callers.

I've restored the throw/catch version of these change.

perhaps we add a test to ensure that setting the bookmark fails and also that it doesn’t throw.

The PR includes tests like that. They test the processing behavior without testing the bookmark functions directly.

@jonsurrell commented on PR #10616:


3 weeks ago
#21

@dmsnell Thanks, I've reworked some of the testing based on your feedback.

I do plan to land https://github.com/WordPress/wordpress-develop/pull/10820 to increase the max bookmarks as well.

#22 @jonsurrell
3 weeks ago

  • Owner set to jonsurrell
  • Resolution set to fixed
  • Status changed from assigned to closed

In 61755:

HTML API: Prevent bookmark exhaustion from throwing.

When bookmark exhaustion occurs during processing, return false instead of throwing an Exception.

Developed in https://github.com/WordPress/wordpress-develop/pull/10616.

Props jonsurrell, westonruter, dmsnell.
Fixes #64394.

#23 @jonsurrell
3 weeks ago

  • Resolution fixed deleted
  • Status changed from closed to reopened

#24 @jonsurrell
3 weeks ago

In 61756:

HTML API: Increase HTML Processor bookmark limit to 10,000.

The limit was overly conservative and bookmark exhaustion may occur on realistic documents. The increased limit allows documents with much greater depth to be safely processed.

The limit is increased from 100 to 10,000.

Developed in https://github.com/WordPress/wordpress-develop/pull/10820.

Follow-up to [61755].

Props jonsurrell, dmsnell, westonruter.
See #64394.

@jonsurrell commented on PR #10820:


3 weeks ago
#25

Merged in r61756.

#26 @jonsurrell
3 weeks ago

  • Resolution set to fixed
  • Status changed from reopened to closed
Note: See TracTickets for help on using tickets.