Make WordPress Core

Opened 5 months ago

Closed 2 months ago

#61545 closed enhancement (fixed)

HTML API: Performance Improvements for 6.7

Reported by: dmsnell's profile dmsnell Owned by: dmsnell's profile dmsnell
Milestone: 6.7 Priority: normal
Severity: normal Version: 6.6
Component: HTML API Keywords: has-patch has-unit-tests
Focuses: performance Cc:

Description (last modified by dmsnell)

The HTML API is already efficient, but it can (possibly) be better.

There are multiple ways to potentially improve the performance of the Tag Processor, HTML Processor, and HTML Decoder. This ticket is a tracking ticket for those experiments and changes

Experiments

Optimize low-level details: #6890

Hypothesis: by auditing various low-level functions and adding a few micro-optimizations, the HTML Tag Processor will scan faster.

Testing results: This change seems to have a marked improvement in scanning times, but since there are several changes incorporated into the patch it's unclear if any specific change was dominant. Of interest are a few places in the hot patch where a branch was removed.

Improvement in the token-scanning measures between 3.5% and 7.5% on average, a small tail of documents are slower, and a long tail are much faster, even above 15% faster. It's unclear what exactly directs the performance behaviors, but it's complicated and document-dependent.

Conclusion:

  • Merge this patch.
  • Continue trying to build a model for what directs the performance behaviors.

Replace the attribute associative array with a simple list. #5774

Hypothesis: by replacing the associative array and hash lookup of parsed attributes with a numeric array the Tag Processor will perform less work and go faster. This should be accomplished by skipping that hash lookup for the associative array of known attributes and by skipping the comparable lookup for detecting duplicates. In HTML, the first attribute is the real one, so it's okay to track a location for every duplicate attribute and simply find the first parsed one as the real attribute.

Testing results: TBD.

Conclusion:

  • Need to update the PR to rebase against the current trunk.
  • Need to run benchmarks against one of the large datasets, like the top100 list.

Change History (17)

This ticket was mentioned in PR #6890 on WordPress/wordpress-develop by @dmsnell.


5 months ago
#1

  • Keywords has-patch added

Trac ticket: Core-61545

## Status

  • passes all tests, code audit looks right
  • performs more efficiently in almost all cases, but worse in only a few minor cases. the expected speedup is around 3.5% - 7.5% for the cost of parsing the tags in realistic HTML documents.

## Summary

Measured change in parsing time to count tags in HTML5 spec:

  • from 1050ms to 930ms
  • roughly 13% faster in the HTML spec single-page.html
  • much less speedup in small documents with a higher text-to-syntax ratio

## Benchmarking

Based on the top100 set of URLs from https://github.com/ada-url/url-various-datasets I ran a script count the number of tags in each document, having previously downloaded all URLs.

Of these, not all downloaded successfully and not all were HTML files.

82,406 HTML files were analyzed, representing pages from the top 100 most popular websites.

When counting all tags, *trunk* took between 310 seconds and 313 seconds across multiple test runs, measured from microtime() within the process parsing the HTML, and only measuring around the next_token() loop.

On this branch, the counting took between 293 seconds and 300 seconds, representing around a 5% real-world improvement in token parsing speed.

For the top100 dataset, this histogram represents the relative parsing speed in MB/s for the branch against trunk.

https://github.com/WordPress/wordpress-develop/assets/5431237/2f6dbaca-7e08-4ba8-8c28-42d2e57f8d4b

This ticket was mentioned in PR #5774 on WordPress/wordpress-develop by @dmsnell.


5 months ago
#2

  • Keywords has-unit-tests added

Trac Ticket: Core-61545

Experimenting to determine if we have any measurable performance overhead in the associate array that previously stored parsed attribute tokens.

From:

$this->attributes[ $attribute_name ] = $token;

To:

$this->attributes[] = $token;

Previously we also tracked duplicate attributes so that we could remove them on remove_attribute() but in this case, we don't perform any determination for duplicate attributes until removing an attribute. This is because the first attribute (the one with the value) will appear first in the list of parsed attributes, and we can ignore the rest.

## Performance testing

In every case this has been improving the speed of the Tag Processor
by between 3% and 5%. Unfortunately I'm unable to upload images right now due to my internet connection, but I've tried with the single-page.html document that is the 12 MB spec document and the demo-post.html from Gutenberg and the results are consistent.

All tests use marginally less memory as well, but the total memory usage is less than 2 KB so it's hardly a gain 🙃

#3 @dmsnell
5 months ago

  • Description modified (diff)

#4 @dmsnell
5 months ago

In 58613:

HTML API: Optimize low-level parsing details in Tag Processor.

Introduces a number of micro-level optimizations in the Tag Processor to
improve token-scanning performance. Should contain no functional changes.

Based on benchmarking against a list of the 100 most-visited websites,
these changes result in an average improvement in performance of the Tag
Processor for scanning tags from between 3.5% and 7.5%.

Developed in https://github.com/WordPress/wordpress-develop/pull/6890
Discussed in https://core.trac.wordpress.org/ticket/61545

Follow-up to [55203].

See #61545.

#6 @westonruter
4 months ago

I found that r58613 caused a regression. Namely, it is only considering the initial letter of a tag name when doing a next_tag() call with a tag_name query. For example:

<?php
$picture_markup = '<p><img width="1024" height="683" src="http://localhost:8889/wp-content/uploads/2024/08/leaves-232-1024x683.webp" class="wp-image-9" alt="Green Leaves" decoding="async" loading="lazy" srcset="http://localhost:8889/wp-content/uploads/2024/08/leaves-232-1024x683.webp 1024w, http://localhost:8889/wp-content/uploads/2024/08/leaves-232-300x200.webp 300w, http://localhost:8889/wp-content/uploads/2024/08/leaves-232-768x512.webp 768w, http://localhost:8889/wp-content/uploads/2024/08/leaves-232-jpg.webp 1080w" sizes="(max-width: 1024px) 100vw, 1024px" /></p>';
$picture_processor = new WP_HTML_Tag_Processor( $picture_markup );
var_dump( $picture_processor->next_tag( array( 'tag_name' => 'picture' ) ) );
var_dump( $picture_processor->get_tag() );

This unexpectedly outputs even though there is no PICTURE tag in the input:

bool(true)
string(1) "P"

If I remove the P tag, then it outputs as expected:

bool(false)
NULL

#7 @dmsnell
4 months ago

Strange. Thanks @westonruter - I had seen this and thought we had it fixed. If it's the same thing I saw, then any substring for a tag name will match, since I overlooked a string length comparison check in matches().

I'll make sure this is corrected today.

This ticket was mentioned in PR #7189 on WordPress/wordpress-develop by @dmsnell.


4 months ago
#8

Trac ticket: Core-61545.

An optimization pass on the HTML API left a bug in the matches()
method, whereby it would falsely detect a tag name match if the
found tag were a lexical subset of the requested tag. This occurred
because of the use of substr_compare() without checking that the
outer lengths matched.

This patch resolves the bug by adding the length check.

Follow-up to [58613] (#6890).
Props @dmsnell, @westonruter.
See Core-61545.

@dmsnell commented on PR #7189:


4 months ago
#9

I've confirmed that the tests fail without the fix but pass with it.

#10 @dmsnell
4 months ago

In 58893:

HTML API: Only stop on full matches for requested tag name.

An optimization pass on the HTML API left a bug in the matches()
method, whereby it would falsely detect a tag name match if the
found tag were a lexical subset of the requested tag. This occurred
because of the use of substr_compare() without checking that the
outer lengths matched.

This patch resolves the bug by adding the length check.

Developed in https://github.com/wordpress/wordpress-develop/pull/7189
Discussed in https://core.trac.wordpress.org/ticket/61545

Follow-up to [58613].
Props dmsnell, westonruter.
See #61545.

This ticket was mentioned in Slack in #core-performance by mukeshpanchal27. View the logs.


4 months ago

This ticket was mentioned in Slack in #core-performance by mukeshpanchal27. View the logs.


3 months ago

#14 @dmsnell
3 months ago

  • Owner set to dmsnell
  • Status changed from new to assigned

This ticket was mentioned in Slack in #core-performance by mukeshpanchal27. View the logs.


3 months ago

This ticket was mentioned in Slack in #core-performance by mukeshpanchal27. View the logs.


3 months ago

#17 @davidbaumwald
2 months ago

  • Resolution set to fixed
  • Status changed from assigned to closed

With 6.7 Beta 1 releasing shortly, I'm closing this ticket as Fixed. Any follow-up work can happen in new tickets. If there's additional iteration needed here specifically, this can be reopened as a Task (Blessed) in 6.7.

Note: See TracTickets for help on using tickets.