Make WordPress Core

Opened 20 months ago

Closed 13 months ago

Last modified 12 months ago

#56299 closed enhancement (duplicate)

WP_HTML_Walker: Inject dynamic data to block HTML markup in PHP

Reported by: zieladam's profile zieladam Owned by:
Milestone: Priority: normal
Severity: normal Version:
Component: General Keywords: has-patch has-unit-tests
Focuses: Cc:

Description

This Trac ticket surfaces Gutenberg PR 42485 from GitHub – let's keep the discussion there.

Let's introduce a lean HTML processor:

<?php
$w = new WP_HTML_Walker('<div id="first"><img /></div>');
$w->next_tag('img');
$w->set_attribute('src', '/wp-content/logo.png');
echo $w;
// <div id="first"><img src="/wp-content/logo.png" /></div>

Why?

Because dynamic blocks often need to inject a CSS class name or set <img src /> in the rendered block HTML markup but lack the means to do so:

At the moment, every block comes up with a custom-tailored regular expression. It is limiting, error-prone, and demands code duplication. It also leads to PRs such as strip whitespaces in render_block_core_cover.

Gutenberg PR on GitHub

Change History (30)

This ticket was mentioned in Slack in #core-editor by gziolo. View the logs.


20 months ago

This ticket was mentioned in PR #3920 on WordPress/wordpress-develop by @dmsnell.


13 months ago
#2

  • Keywords has-patch has-unit-tests added

## Status

❓ Is it acceptable to use the u Unicode flag in a PCRE pattern in Core?

This is the last remaining pertinent question form the author from a design perspective; it should be answered before a merge.

## Description

This commit pulls in the HTML Tag Processor from the Gutenbeg repository (pulled from WordPress/Gutenberg@98ce5c53c9492764aeff88852a4073fb32e784ac
). The Tag Processor attempts to be an HTML5-spec-compliant parser that provides the ability in PHP to find specific HTML tags and then add, remove, or update attributes on that tag. It provides a safe and reliable way to modify the attribute on HTML tags.

{{{php
Add missing rel attribute to links.
$p = new WP_HTML_Tag_Processor( $block_content );
if ( $p->next_tag( 'A' ) && empty( $p->get_attribute( 'rel' ) ) ) {

$p->set_attribute( 'noopener nofollow' );

}
return $p->get_updated_html();
}}}

Introduced originally in WordPress/Gutenberg#42485 and in https://core.trac.wordpress.org/ticket/56299 and developed within the Gutenberg repository, this HTML parsing system was built in order to address a persistent need (properly modifying HTML tag attributes) and was motivated after a sequence of block editor defects which stemmed from mismatches between actual HTML code and expectations for HTML input running through existing naive string-search-based solutions.

The Tag Processor is intended to operate fast enough to avoid being an obstacle on page render while using as little memory overhead as possible. It is practically a zero-memory-overhead system, and only allocates memory as changes to the input HTML document are enqueued, releasing that memory when flushing those changes to the document, moving on to find the next tag, or flushing its entire output via get_updated_html().

Rigor has been taken to ensure that the Tag Processor will not be consfused by unexpected or non-normative HTML input, including issues arising from quoting, from different syntax rules within <title>, <textarea>, and <script> tags, from the appearance of rare but legitimate comment and XML-like regions, and from a variety of syntax abnormalities such as unbalanced tags, incomplete syntax, and overlapping tags.

The Tag Processor is constrained to parsing an HTML document as a stream of tokens. It will not build an HTML tree or generate a DOM representation of a document. It is designed to start at the beginning of an HTML document and linearly scan through it, potentially modifying that document as it scans. It has no access to the markup inside or around tags and it has no ability to determine which tag openers and tag closers belong to each other, or determine the nesting depth of a given tag.

It includes a primitive bookmarking system to remember tags it has previously visited. These bookmarks refer to specific tags, not to string offsets, and continue to point to the same place in the document as edits are applied. By asking the Tag Processor to seek to a given bookmark it's possible to back up and continue processsing again content that has already been traversed.

Attribute values are sanitized with esc_attr() and rendered as double-quoted attributes. On read they are unescaped and unquoted. Authors wishing to rely on the Tag Processor therefore are free to pass around data as normal strings.

Convenience methods for adding and removing CSS class names exist in order to remove the need to process the class attribute.

{{{php
Update heading block class names
$p = new WP_HTML_Tag_Processor( $html );
while ( $p->next_tag() ) {

switch ( $p->get_tag() ) {

case 'H1':
case 'H2':
case 'H3':
case 'H4':
case 'H5':
case 'H6':

$p->remove_class( 'wp-heading' );
$p->add_class( 'wp-block-heading' );
break;

}
return $p->get_updated_html();
}}}

The Tag Processor is intended to be a reliable low-level library for traversing HTML documents and higher-level APIs are to be built upon it. Immediately, and in Core Gutenberg blocks it is meant to replace HTML modification that currently relies on RegExp patterns and simpler string replacements.

See the following for examples of such replacement:

Co-Authored-By: Adam Zielinski <adam@…>
Co-Authored-By: Bernie Reiter <ockham@…>
Co-Authored-By: Grzegorz Ziolkowski <grzegorz@…>

Trac ticket: https://core.trac.wordpress.org/ticket/57575

@dmsnell commented on PR #3920:


13 months ago
#3

If it pleases all involved I would prefer to rewrite this branch before merge so that we can ensure that the commit message and commits are proper, communicative, and relevant.

I cannot request changes on my own PR, but I'd appreciate it if someone would proxy-request-changes on my behalf so that we don't accidentally merge before that last communication step occurs. cc: @ockham

@azaozz commented on PR #3920:


13 months ago
#4

don't merge PRs but have to commit to svn

Right, the basic workflow is to take all the changes in the PR as a .diff from here: https://patch-diff.githubusercontent.com/raw/WordPress/wordpress-develop/pull/3920.diff then apply that diff to an SVN checkout of core trunk and commit it from there. (See the View Patch button under the description on https://core.trac.wordpress.org/ticket/57575).

@dmsnell commented on PR #3920:


13 months ago
#5

@costdev I've run through all the tests and removed or adjusted most of the @covers calls. In many cases there's no clear yes or no to whether the test covers a given set of methods. get_updated_html()'s behavior is covered by almost all test methods, but not directly as if testing it's behavior, because the interplay between all of the modification methods and get_updated_html() is deeply coupled.

To that end I've tried to reframe those in terms of what semantic operations are intended to be covered. I couldn't find a good way to be consistent the way I think you wanted it to be, so this is the best way to say, "for someone wanting to use set_attribute() to change a document, this test is covering that case, even though it's impossible to test this only by examining set_attribute()"

@anton-vlasenko I've added doing_it_wrong everywhere that we're reporting something as a guaranteed failure. thanks for that recommendation.

@dmsnell commented on PR #3920:


13 months ago
#6

Does anyone know the proper way to assert in a test that _doing_it_wrong was called?

@peterwilsoncc commented on PR #3920:


13 months ago
#7

Does anyone know the proper way to assert in a test that _doing_it_wrong was called?

You can use setExpectedIncorrectUsage per

https://github.com/WordPress/wordpress-develop/blob/ab7f91562deaaa5eac5fd64c11dd38fe687f48db/tests/phpunit/tests/db.php#L1521

#8 @gziolo
13 months ago

  • Milestone Awaiting Review deleted
  • Resolution set to duplicate
  • Status changed from new to closed

Duplicate of #57575.

@gziolo commented on PR #3920:


13 months ago
#9

don't merge PRs but have to commit to svn

Right, the basic workflow is to take all the changes in the PR as a .diff from here: https://patch-diff.githubusercontent.com/raw/WordPress/wordpress-develop/pull/3920.diff then apply that diff to an SVN checkout of core trunk and commit it from there. (See the View Patch button under the description on https://core.trac.wordpress.org/ticket/57575).

I love this command that takes care of everything:

{{{bash
npm run grunt patch:https://github.com/WordPress/wordpress-develop/pull/3920
}}}

@hellofromTonya commented on PR #3920:


13 months ago
#10

Hey all 👋

### Status Update:

  • Changes requested:

Several committers have reviewed and requested changes, none of which affect functionality, but are essential for commit consideration (as we committers require the same kinds of changes for all code being committed into Core).

  • @felixarntz, the 6.2 Performance Lead, is actively doing a performance assessment.

I asked him to do this assessment. Why? To understand with data if there could be impacts to users who have blocks/content that will now consume this API when they upgrade to 6.2. _With data_ this unknown concern can either be (a) eliminated or (b) known for further consideration.

Why? Native PHP RegEx and str functionality is faster than this API. So it'll be good to know if there are performance impacts for users upgrading to 6.2.

### Commit Proposal

GOAL: Commit today IF ALL are met:

  1. All of the requested changes from committers is committed into this PR.
  2. No significant performance regressions are found.

If/when both of the above are met, then commit.
Psst, I pinged @azaozz for commit consideration during his work day (as he works past my end of day).

Why today?
This API is blocking other feature/enhancement backports. Beta 1 is Tuesday. Let's do our best to resolve what can be resolved today. Then identify unknowns or concerns without data to be observed and monitored through the Beta cycle.

How does this plan sound? 👍 or 👎
@felixarntz @dmsnell @ockham @ntsekouras @azaozz @costdev @gziolo @anton-vlasenko

@dmsnell commented on PR #3920:


13 months ago
#11

Correct me if I am wrong but I think I noticed parsing of script tags accounted for the old school <script></script> but couldn't see any tests for that. Unless I've missed something, could that be tested too.

@peterwilsoncc you should find several tests for script tag processing, including the case you mentioned, in `test_next_tag_ignores_script_tag_contents`. The exact string you noted isn't in the tests, but it's covered by existing tests and handled properly.

@hellofromTonya commented on PR #3920:


13 months ago
#12

Status Update:

Following up on https://github.com/WordPress/wordpress-develop/pull/3920#issuecomment-1413760031.

  • All change requested are complete ✅

What's left for it to be committed?

  • Committer approval: currently there are no committers who have approved this PR yet
  • Completion of the performance assessment I asked @felixarntz to do ⚪

My weekend starts soon (well as soon as I clear my backport review and commit queue - not including this PR). I'll be back on Monday. The decision to commit this PR before Monday is in @azaozz and @felixarntz hands.

@azaozz @felixarntz IF there's consensus to commit it, then please commit it as soon as it's approved. Why? Other work is dependent upon this PR. So IF it's going into 6.2, sooner is better than later.

Thank you everyone for contributions, patience, and expertise!

cc @ntsekouras and @Mamaduka to keep you aware.

@hellofromTonya commented on PR #3920:


13 months ago
#13

For transparency, I was asked today:
Why not just commit it and then do the performance regression testing during Beta or RC?

Here's my response:

The reason why I asked to get it done before it gets committed is this:

If there's a performance regression that warrants this not going in, then all of the other work this is dependent upon it needs to be reworked and plans changed. That needs to known ASAP. Waiting to do it later means putting those features and the 6.2 release at risk.

@flixos90 commented on PR #3920:


13 months ago
#14

I ran a performance test for this via #3971, a PR purely intended for testing, which is directly based on this PR here, but also includes relevant usage from #3914 and https://github.com/dmsnell/wordpress-develop/pull/1 (see https://github.com/WordPress/wordpress-develop/commit/8d7ebfb21c643088139e178e2f67bedecc49791b).

TLDR: The performance impact appears to be negligible, which is a great outcome. There is no clear tendency of increase or decline in performance. Overall the data, if anything, indicates a this PR being a tiny bit faster than trunk today. The difference is too small to be able to surely say that, but at least it should be clear now that this is not a concern from a performance perspective. 👍

The test is based on the Gutenberg demo content, tested with the last three default themes, using the median from 20 requests in each scenario (#3971 vs trunk).

See this spreadsheet for the data and also notes on how exactly the test was conducted.

So I believe this should be unblocked now. Thanks everyone for the patience, I am really glad to see this positive outcome.

@hellofromTonya commented on PR #3920:


13 months ago
#15

What is the performance regression test I've asked for?

For blocks or features that are in Core now (<=6.1) but will change to use this API, what is the performance impact when users upgrade to 6.2?

Why?
This API will execute a lot of PHP code vs running RegEx and/or PHP native string functions. _It will have an impact_ How much of an impact is unknown. Why? This type of testing has not happened yet.

The data collected from a regression test is needed to know what the impact will be to users upgrading to 6.2. That information will then help with decision making about if that impact is acceptable or not.

There are many benefits to this API. For example, it will make maintaining string manipulation much easier and understandable. It will help resolve the challenges of working with HTML markup in PHP (whitespaces, malformed, etc.), though these challenges have always existed. That said, it also has be performant to not cause issues for users.

@dmsnell commented on PR #3920:


13 months ago
#16

Note to whoever commits this: I've posted a commit message in a details tag in the description at the top. A couple typos are fixed from the git commit message, and I've updated the props syntax for core instead of for GitHub

@zieladam commented on PR #3920:


13 months ago
#18

This is amazing, I really appreciate everyone's collaboration on this one ❤️

This ticket was mentioned in PR #4164 on WordPress/wordpress-develop by @zieladam.


12 months ago
#19

## Description

With this PR, Tag Processor can navigate to </script>, </textarea>, and </title> tag closers:

{{{php
$p = new WP_HTML_Tag_Processor('<script>ABC</script><p>');
$p->next_tag();
$p->next_tag( array( 'tag_closers' => 'visit' ) );
$p->get_tag(); 'script'
}}}

Without this commit, the Tag Processor skips over them:

{{{php
$p = new WP_HTML_Tag_Processor('<script>ABC</script><p>');
$p->next_tag();
$p->next_tag( array( 'tag_closers' => 'visit' ) );
$p->get_tag(); 'p'
}}}

Tag closers are supported by Tag Processor so it only makes sense to consistently support all of them.

cc @ockham @dmsnell @hellofromtonya @gziolo

Trac ticket: https://core.trac.wordpress.org/ticket/56299

@hellofromTonya commented on PR #4164:


12 months ago
#20

@adamziel https://core.trac.wordpress.org/ticket/56299 is closed. I opened a new Trac ticket for this enhancement.

Question: Has this already been released in Gutenberg? If no, then likely this needs to move to 6.3.

@zieladam commented on PR #4164:


12 months ago
#21

@hellofromtonya it haven't been released in Gutenberg yet, no. Let's punt to 6.3 then. I'll loop in @ntsekouras and @Mamaduka just in case

@hellofromTonya commented on PR #4164:


12 months ago
#22

Let's punt to 6.3 then. Or 6.2.1? I'll loop in @ntsekouras and @Mamaduka just in case

Enhancements need to go into a major. I'll move it to 6.3. Thank you @adamziel!

@hellofromTonya commented on PR #4164:


12 months ago
#23

Also wondering @adamziel @dmsnell, 6.3 is set to release in August 2023. Instead of continuing development in Core on the HTML API, might be better to continue its refinement and enhanced support / capability back to Gutenberg. Then backport them to Core when ready. That way, each can gain faster testing and feedback cycles. What do you think?

@zieladam commented on PR #4164:


12 months ago
#24

Enhancements need to go into a major.

@hellofromtonya oh this is a bugfix, not an enhancement. The Tag Processor should have never omitted these tag closers in the first place.

@hellofromTonya commented on PR #4164:


12 months ago
#25

oh this is a bugfix, not an enhancement. The Tag Processor should have never omitted these tag closers in the first place.

@adamziel thank you for clarifying!

Is this bug fix essential for 6.2? If yes, it could be committed into 6.2, but will trigger another beta release.

@zieladam commented on PR #4164:


12 months ago
#26

@hellofromtonya all good, 6.2.1 should be fine. I'll reply to your other comment later on

@zieladam commented on PR #4164:


12 months ago
#27

Instead of continuing development in Core on the HTML API, might be better to continue its refinement and enhanced support / capability back to Gutenberg. Then backport them to Core when ready. That way, each can gain faster testing and feedback cycles. What do you think?

There's a few reason these Tag Processor PRs are targeting core now:

  • It eliminates the risk of a large core PR too late in the release cycle – all the changes are already there
  • Backporting changes to Core isn't fast or simple, but backporting changes to Gutenberg is
  • Not updating Gutenberg files means no conflicts to resolve
  • The unit tests have been moved to core and it's a hassle to run or update them in the Gutenberg repo. I suppose Gutenberg could maintain a separate copy of the unit tests as well, but that means additional backporting and reconcilliation work

cc @dmsnell @ockham

@zieladam commented on PR #4164:


12 months ago
#28

@hellofromtonya @dmsnell I went for a separate test case without a data provider to test both cases (<script></script> and </script>). I couldn't find a clean way of fitting them into the existing test cases that were mentioned in a discussion.

@hellofromTonya commented on PR #4164:


12 months ago
#29

Confirmed:

  • Tests fail without the fixes ❌
  • Tests pass with the fixes ✅
  • Manually testing:
    • Able to reproduce the issue without the fix ❌
    • After applying the fix, works as expected ✅


Test Report https://core.trac.wordpress.org/ticket/57852#comment:16

Note: See TracTickets for help on using tickets.