Opened 5 months ago
Closed 2 months ago
#61545 closed enhancement (fixed)
HTML API: Performance Improvements for 6.7
Reported by: | dmsnell | Owned by: | dmsnell |
---|---|---|---|
Milestone: | 6.7 | Priority: | normal |
Severity: | normal | Version: | 6.6 |
Component: | HTML API | Keywords: | has-patch has-unit-tests |
Focuses: | performance | Cc: |
Description (last modified by )
The HTML API is already efficient, but it can (possibly) be better.
There are multiple ways to potentially improve the performance of the Tag Processor, HTML Processor, and HTML Decoder. This ticket is a tracking ticket for those experiments and changes
Experiments
Optimize low-level details: #6890
Hypothesis: by auditing various low-level functions and adding a few micro-optimizations, the HTML Tag Processor will scan faster.
Testing results: This change seems to have a marked improvement in scanning times, but since there are several changes incorporated into the patch it's unclear if any specific change was dominant. Of interest are a few places in the hot patch where a branch was removed.
Improvement in the token-scanning measures between 3.5% and 7.5% on average, a small tail of documents are slower, and a long tail are much faster, even above 15% faster. It's unclear what exactly directs the performance behaviors, but it's complicated and document-dependent.
Conclusion:
- Merge this patch.
- Continue trying to build a model for what directs the performance behaviors.
Replace the attribute associative array with a simple list. #5774
Hypothesis: by replacing the associative array and hash lookup of parsed attributes with a numeric array the Tag Processor will perform less work and go faster. This should be accomplished by skipping that hash lookup for the associative array of known attributes and by skipping the comparable lookup for detecting duplicates. In HTML, the first attribute is the real one, so it's okay to track a location for every duplicate attribute and simply find the first parsed one as the real attribute.
Testing results: TBD.
Conclusion:
- Need to update the PR to rebase against the current
trunk
. - Need to run benchmarks against one of the large datasets, like the
top100
list.
Change History (17)
This ticket was mentioned in PR #6890 on WordPress/wordpress-develop by @dmsnell.
5 months ago
#1
- Keywords has-patch added
This ticket was mentioned in PR #5774 on WordPress/wordpress-develop by @dmsnell.
5 months ago
#2
- Keywords has-unit-tests added
Trac Ticket: Core-61545
Experimenting to determine if we have any measurable performance overhead in the associate array that previously stored parsed attribute tokens.
From:
$this->attributes[ $attribute_name ] = $token;
To:
$this->attributes[] = $token;
Previously we also tracked duplicate attributes so that we could remove them on remove_attribute()
but in this case, we don't perform any determination for duplicate attributes until removing an attribute. This is because the first attribute (the one with the value) will appear first in the list of parsed attributes, and we can ignore the rest.
## Performance testing
In every case this has been improving the speed of the Tag Processor
by between 3% and 5%. Unfortunately I'm unable to upload images right now due to my internet connection, but I've tried with the single-page.html
document that is the 12 MB spec document and the demo-post.html
from Gutenberg and the results are consistent.
All tests use marginally less memory as well, but the total memory usage is less than 2 KB so it's hardly a gain 🙃
#6
@
4 months ago
I found that r58613 caused a regression. Namely, it is only considering the initial letter of a tag name when doing a next_tag()
call with a tag_name
query. For example:
<?php $picture_markup = '<p><img width="1024" height="683" src="http://localhost:8889/wp-content/uploads/2024/08/leaves-232-1024x683.webp" class="wp-image-9" alt="Green Leaves" decoding="async" loading="lazy" srcset="http://localhost:8889/wp-content/uploads/2024/08/leaves-232-1024x683.webp 1024w, http://localhost:8889/wp-content/uploads/2024/08/leaves-232-300x200.webp 300w, http://localhost:8889/wp-content/uploads/2024/08/leaves-232-768x512.webp 768w, http://localhost:8889/wp-content/uploads/2024/08/leaves-232-jpg.webp 1080w" sizes="(max-width: 1024px) 100vw, 1024px" /></p>'; $picture_processor = new WP_HTML_Tag_Processor( $picture_markup ); var_dump( $picture_processor->next_tag( array( 'tag_name' => 'picture' ) ) ); var_dump( $picture_processor->get_tag() );
This unexpectedly outputs even though there is no PICTURE
tag in the input:
bool(true) string(1) "P"
If I remove the P
tag, then it outputs as expected:
bool(false) NULL
#7
@
4 months ago
Strange. Thanks @westonruter - I had seen this and thought we had it fixed. If it's the same thing I saw, then any substring for a tag name will match, since I overlooked a string length comparison check in matches()
.
I'll make sure this is corrected today.
This ticket was mentioned in PR #7189 on WordPress/wordpress-develop by @dmsnell.
4 months ago
#8
Trac ticket: Core-61545.
An optimization pass on the HTML API left a bug in the matches()
method, whereby it would falsely detect a tag name match if the
found tag were a lexical subset of the requested tag. This occurred
because of the use of substr_compare()
without checking that the
outer lengths matched.
This patch resolves the bug by adding the length check.
Follow-up to [58613] (#6890).
Props @dmsnell, @westonruter.
See Core-61545.
This ticket was mentioned in Slack in #core-performance by mukeshpanchal27. View the logs.
4 months ago
This ticket was mentioned in Slack in #core-performance by mukeshpanchal27. View the logs.
3 months ago
This ticket was mentioned in Slack in #core-performance by mukeshpanchal27. View the logs.
3 months ago
This ticket was mentioned in Slack in #core-performance by mukeshpanchal27. View the logs.
3 months ago
#17
@
2 months ago
- Resolution set to fixed
- Status changed from assigned to closed
With 6.7 Beta 1 releasing shortly, I'm closing this ticket as Fixed
. Any follow-up work can happen in new tickets. If there's additional iteration needed here specifically, this can be reopened as a Task (Blessed)
in 6.7
.
Trac ticket: Core-61545
## Status
## Summary
Measured change in parsing time to count tags in HTML5 spec:
single-page.html
## Benchmarking
Based on the
top100
set of URLs from https://github.com/ada-url/url-various-datasets I ran a script count the number of tags in each document, having previously downloaded all URLs.Of these, not all downloaded successfully and not all were HTML files.
82,406 HTML files were analyzed, representing pages from the top 100 most popular websites.
When counting all tags, *trunk* took between 310 seconds and 313 seconds across multiple test runs, measured from
microtime()
within the process parsing the HTML, and only measuring around thenext_token()
loop.On this branch, the counting took between 293 seconds and 300 seconds, representing around a 5% real-world improvement in token parsing speed.
For the top100 dataset, this histogram represents the relative parsing speed in MB/s for the branch against
trunk
.