Make WordPress Core

Opened 22 months ago

Closed 21 months ago

Last modified 3 days ago

#57575 closed enhancement (fixed)

Editor: Introduce HTML Tag Processor

Reported by: azaozz's profile azaozz Owned by: azaozz's profile azaozz
Milestone: 6.2 Priority: normal
Severity: normal Version:
Component: HTML API Keywords: has-patch has-unit-tests gutenberg-merge has-testing-info needs-dev-note add-to-field-guide
Focuses: performance Cc:

Description

This pulls in the HTML Tag Processor from the Gutenbeg repository (pulled from WordPress/gutenberg@98ce5c5). The Tag Processor attempts to be an HTML5-spec-compliant parser that provides the ability in PHP to find specific HTML tags and then add, remove, or update attributes on that tag. It provides a safe and reliable way to modify the attribute on HTML tags.

// Add missing `rel` attribute to links.
$p = new WP_HTML_Tag_Processor( $block_content );
if ( $p->next_tag( 'A' ) && empty( $p->get_attribute( 'rel' ) ) ) {
    $p->set_attribute( 'noopener nofollow' );
}
return $p->get_updated_html();

Introduced originally in WordPress/gutenberg#42485 and developed within the Gutenberg repository, this HTML parsing system was built in order to address a persistent need (properly modifying HTML tag attributes) and was motivated after a sequence of block editor defects which stemmed from mismatches between actual HTML code and expectectations for HTML input running through existing naive string-search-based solutions.

The Tag Processor is intended to operate fast enough to avoid being an obstacle on page render while using as little memory overhead as possible. It is practically a zero-memory-overhead system, and only allocates memory as changes to the input HTML document are enqueued, releasing that memory when flushing those changes to the document, moving on to find the next tag, or flushing its entire output via get_updated_html().

Rigor has been taken to ensure that the Tag Processor will not be consfused by unexpected or non-normative HTML input, including issues arising from quoting, from different syntax rules within <title>, <textarea>, and <script> tags, from the appearance of rare but legitimate comment and XML-like regions, and from a variety of syntax abnormalities such as unbalanced tags, incomplete syntax, and overlapping tags.

The Tag Processor is constrained to parsing an HTML document as a stream of tokens. It will not build an HTML tree or generate a DOM representation of a document. It is designed to start at the beginning of an HTML document and linearly scan through it, potentially modifying that document as it scans. It has no access to the markup inside or around tags and it has no ability to determine which tag openers and tag closers belong to each other, or determine the nesting depth of a given tag.

It includes a primitive bookmarking system to remember tags it has previously visited. These bookmarks refer to specific tags, not to string offsets, and continue to point to the same place in the document as edits are applied. By asking the Tag Processor to seek to a given bookmark it's possible to back up and continue processsing again content that has already been traversed.

Attribute values are sanitized with esc_attr() and rendered as double-quoted attributes. On read they are unescaped and unquoted. Authors wishing to rely on the Tag Processor therefore are free to pass around data as normal strings.

Convenience methods for adding and removing CSS class names exist in order to remove the need to process the class attribute.

// Update heading block class names
$p = new WP_HTML_Tag_Processor( $html );
while ( $p->next_tag() ) {
    switch ( $p->get_tag() ) {
	case 'H1':
	case 'H2':
	case 'H3':
	case 'H4':
	case 'H5':
	case 'H6':
	    $p->remove_class( 'wp-heading' );
	    $p->add_class( 'wp-block-heading' );
	    break;
}
return $p->get_updated_html();

The Tag Processor is intended to be a reliable low-level library for traversing HTML documents and higher-level APIs are to be built upon it. Immediately, and in Core Gutenberg blocks it is meant to replace HTML modification that currently relies on RegExp patterns and simpler string replacements.

See the following for examples of such replacement:

By @dmsnell, co-authored-by: @zieladam, @bernhard-reiter, @gziolo.

Change History (69)

This ticket was mentioned in PR #3920 on WordPress/wordpress-develop by @dmsnell.


22 months ago
#2

  • Keywords has-unit-tests added

@azaozz commented on PR #3920:


22 months ago
#3

Wondering if it would be better to move the if ( ! class_exists( ... ) ) to wp-html.php (where all files are required). There are quite a few places in core code where similar code is used. It will also stop WPCS from complaining about indentation.

@dmsnell commented on PR #3920:


22 months ago
#4

Wondering if it would be better to move the if ( ! class_exists( ... ) ) to wp-html.php

agreed. I've already done this in the Gutenberg side to get around the linter and will move the changes here once that merges, or beforehand maybe if I can.

@Bernhard Reiter commented on PR #3920:


21 months ago
#5

Wondering if it would be better to move the if ( ! class_exists( ... ) ) to wp-html.php

agreed. I've already done this in the Gutenberg side to get around the linter and will move the changes here once that merges, or beforehand maybe if I can.

I don't think we should even include those guards at all here -- after all, Core is the most "canonical" source for those classes, and its versions should "prevail" over the ones provided by the Gutenberg plugin for back-compat (which shouldn't happen normally, but in case the order of imports is mixed up for whatever reason).

#6 @hellofromTonya
21 months ago

  • Keywords gutenberg-merge added

Adding the experimental gutenberg-merge keyword for tracking of backport/sync from Gutenberg into Core.

@hellofromTonya commented on PR #3920:


21 months ago
#7

Wondering if it would be better to move the if ( ! class_exists( ... ) ) to wp-html.php (where all files are required). There are quite a few places in core where similar code is used. It will also stop WPCS from complaining about indentation.

Hmm, IMO these guards shouldn't be in Core. Why? Core should expect to load itself. The loading of these classes into memory happens early and _before_ a plugin or theme loads. Thus, class_exists() is unnecessary in Core as another instance of Core's classes should not already be in memory when Core loads.

That said, code extending Core does need these guards. For example, Gutenberg needs these guards to prevent loading its classes of the same name.

#8 follow-up: @dmsnell
21 months ago

Pardon me for my ignorance, but why would we want to remove the safety these checks provide? Based on the existing codebase I was under the impression this practice is somewhat normative and important, having also seen cases where people unintentionally pulled in class files with require that were already there, and thus created crashes where they weren't needed.

This is why I preferred wrapping each class in the check, but given that the linter wants to indent the entire file, I moved them to wp-html.php as a compromise.

If a version in Gutenberg were to somehow load first I wouldn't see that as problematic. We aren't wanting to break the API in the future.

@azaozz commented on PR #3920:


21 months ago
#9

I don't think we should even include those guards at all here -- after all, Core is the most "canonical"

Core does not have the responsibility of checking if its classes are already in memory.

It depends :)

What I'm getting at here is that both Gutenberg's PHP (in /lib) and core's PHP should be considered parts of the same codebase. I wouldn't be opposed to have class_exists() and function_exists() in core when they can safeguard against possible fatal errors, now or in the future.

If the Tag Processor's classes (files) are always loaded from wp-settings.php, which happens earlier than any plugin, I agree the class_exists() can only be in Gutenberg. For other cases (conditional loading, etc.) both places should have them.

@dmsnell commented on PR #3920:


21 months ago
#10

@costdev I've marked a number of our conversations as resolved to reduce the noise when reading this PR, but if you didn't find them resolved please un-resolve them and let me know. normally I wouldn't have done that but there were so many non-controversial changes you suggested and I figured it'd be easiest if I hid them from the chat once I accepted your changes.

#11 in reply to: ↑ 8 @hellofromTonya
21 months ago

Replying to dmsnell:

Pardon me for my ignorance, but why would we want to remove the safety these checks provide? Based on the existing codebase I was under the impression this practice is somewhat normative and important, having also seen cases where people unintentionally pulled in class files with require that were already there, and thus created crashes where they weren't needed.

This is why I preferred wrapping each class in the check, but given that the linter wants to indent the entire file, I moved them to wp-html.php as a compromise.

If a version in Gutenberg were to somehow load first I wouldn't see that as problematic. We aren't wanting to break the API in the future.

Good questions. Here are the reasonings:

  1. Core loads first before plugins and themes.
  2. Files are loaded very very early in the request cycle. These are grouped in wp-settings.php.
  3. WordPress Core's code is native to itself. The expectation is that Core does not need to guard its own functions, traits, classes, etc. from custom code that may also use the same custom naming conventions. Notice in Core, it does not wrap each class or function in a *_exists() before loading it into memory. That's on purpose. Why? Because the expectations are:
    • These functions, classes, etc. are native to WordPress Core.
    • Plugins and themes are responsible to guard any of their custom code that duplicates a native Core function, class, etc. to prevent a fatal error.

Since Core loads these files before plugins or themes, the *_exists() wrappers are unnecessary. Why?

  • Core loads first so there should not be a duplicate name already loaded into memory.
  • Core's code is native to Core. See the expectation explanation above in reason 3.

Take a look at the REST, Sitemaps, and other APIs within Core. Notice they are not wrapped in *_exists(). Take a look at all of the functions in Core. Notice they are not wrapped in function_exists().

Does that help @dmsnell?
Is there a compelling reason why the HTML API files are at risk whereas the other native Core functions, classes, etc. are not?

#12 @hellofromTonya
21 months ago

What features will use this API in 6.2?
What existing blocks already in Core (ie from before 6.2) will change to use the API?

I ask for these reasons:

  • Performance analysis.

I asked @flixos90 to review for performance. Knowing which features will use it provides needed information for testing before and after for existing blocks (i.e. blocks already in Core).

I'm curious and concerned about the potential introduction of performance regressions. Why? There's a lot of code to execute vs the PHP native regex and string functions. The way to remove those concerns to do a performance review to collect data.

  • Testing.

Test the usage to validate the API works as expected.

  • Assessing risks to users.

For existing blocks (in <= WP 6.1.x) that will change to use this API, want to test before (6.1.x version) as compared to 6.2 version. Why? Confidence check to identify any markup changes or potential breakages for users upgrading to 6.2.

@dmsnell commented on PR #3920:


21 months ago
#13

@hellofromtonya addressed most of your suggestions in 1fd0d7d511 but ended up reworking much of the phrasing to avoid the situation entirely and to avoid the passive voice (which I'm sure I did inconsistently). thanks for the thorough docs review; that has been one of my highest priorities in this project pre-merge.

@dmsnell commented on PR #3920:


21 months ago
#14

@hellofromtonya @anton-vlasenko I've preemptively marked many of the discussions you opened as resolved, even though I normally wouldn't do this. it's only because the comments in this PR have been getting hard to follow, and Github hiding many of them by default makes it even harder.

if any of those changes aren't to your satisfaction let me know and I'll unmark them.

@hellofromtonya is the @TODO on WP_DEBUG necessary? it seems redundant, particularly because WP_DEBUG _is_ the test for whether we're in development mode, at least for the purposes here. I can add them in, but it also seems like the kind of thing where if we're consistent then we'd need to annotate every use of WP_DEBUG, which would be nothing different than simply having WP_DEBUG - I'm probably missing something. to that end I haven't added them in yet unless they are communicating something that I'm not grasping.

@hellofromTonya commented on PR #3920:


21 months ago
#15

@hellofromtonya is the @TODO on WP_DEBUG necessary? it seems redundant, particularly because WP_DEBUG is the test for whether we're in development mode, at least for the purposes here. I can add them in, but it also seems like the kind of thing where if we're consistent then we'd need to annotate every use of WP_DEBUG, which would be nothing different than simply having WP_DEBUG - I'm probably missing something. to that end I haven't added them in yet unless they are communicating something that I'm not grasping.

Yes the @todo is necessary. Why? It's part of an agreement for using WP_DEBUG within the source code as of WP 6.2. I shared links to where that discussion and agreement happened amongst Committers including @azaozz @felixarntz @spacedmonkey and me.

@dmsnell commented on PR #3920:


21 months ago
#16

Yes the @todo is necessary. Why? It's part of an agreement for using WP_DEBUG within the source code as of WP 6.2. I shared links to where that discussion and agreement happened amongst Committers including @azaozz @felixarntz @spacedmonkey and me.

thanks for clarifying. I don't know if it's another case of Github hiding the discussion but I had reviewed those links and found where you had added the TODOs in other places, but missed the discussion and any decisions relating to it.

I went back and triple-checked to make sure I opened up every hidden comment in Github. I'll add the TODOs, but if there was something in any of those links you meant for me to see I don't know what it is, other than the fact that in some PRs you've added the TODOs, which doesn't particularly come across as stating much.

@flixos90 commented on PR #3920:


21 months ago
#17

@hellofromtonya @dmsnell FWIW I don't think the @todo comments around WP_DEBUG are needed here. It's not that _any_ use-case of WP_DEBUG now needs to be annotated as such.

In this PR, WP_DEBUG is used purely for debugging output, which is its main purpose. The reason in some other PRs we are requiring this annotation is that in those PRs we are misusing WP_DEBUG (for the lack of a better alternative, see https://core.trac.wordpress.org/ticket/57573 and https://core.trac.wordpress.org/ticket/57487) to decide whether or not to cache theme.json related data - something that WP_DEBUG was not meant to control.

So TLDR I don't think the @todo comment is relevant here as it's an actual debugging use-case here.

@dmsnell commented on PR #3920:


21 months ago
#18

It is universally discouraged to introduce a new API in WordPress core without using it.

thanks @felixarntz - I definitely appreciate the question about introducing APIs without use. for context, this is only coming in here at the last minute because somewhere along the way some of us, myself included, didn't realize that code using it was already coming over into 6.2.

right now it's being used with block supports and in several of the blocks. it's my (poor) understanding that that code will also be merging, but given the path of development I think it's best separated into different PRs. for one, some of those features didn't have any prior use, which means there's no diff to apply to move from no tag processor to the tag processor (there are a also a few conversions, linked towards the bottom of the PR description above). and for two, I haven't been close to all of the work that has started relying on the tag processor so I'm one of the least able people to push that work through the process into core.

maybe someone is waiting on me to create PRs for those other things, but if so I haven't held that understanding (@ockham should I be creating PRs for the block supports and block updates that utilize this?)

so I hope that helps, though I'm not sure where that leaves us. the design here was 100% led by the use-cases in the Gutenberg plugin and there's no real prior art, other than the bag of ad-hoc RegExp patterns and string replaces which perform a similar role but aren't intending to put HTML semantics at the forefront. should this go as well as is has been in the plugin there will be more to come, and there's plenty in the works in the Gutenberg repo, but for now this is a new thing with new code in Gutenberg, for the most part.

to restate: I'm sure we can add more code to this PR, but it may be quite messy given that there's very limited actual code where the tag processor was introduced to existing code, or done as a separate step from building out new or updated functionality.

I don't think that's reasonably possible with a massive PR like this

yep I totally understand, though if there's a desire I'd be happy to share overviews of its operation. at a minimum I think a thorough read-through of the comments will hopefully do a good job explaining it.

on the bright side this has already developed in collaboration with at least six people and gone through substantial vetting in Gutenberg and at least in parts of WordPress.com.

if it weren't so necessary to have a complete system in one I would have tried to split it into smaller pieces.

There are several @todo comments in this code, which suggests to me that this is not quite ready for "prime time".

These should not signal that. There are definitely a few remaining TODOs I've been trying to work through before merge, but the items at the top of the file are pretty much design notes on how to enhance or expand the system; not indications that it's not ready.

There are three relevant todos:

  • are we good to use a Unicode flag on PCRE patterns in Core or will that be a problem for sites where the Intl extension is missing?
  • one item I found while auditing code today and hope to have it removed some time tomorrow; this being a special case I don't think we've tested or seen in practice.
  • a note on the design limitations that I left in because I think we may remove that section instead of expand the TODO.

The others can be converted from TODO items to descriptions of things someone might want to do 😉

I don't think the @todo comment is relevant here as it's an actual debugging use-case here.

thanks; this is how I figured it would be, but I'll wait until @hellofromtonya responds before taking them out again.

@ntsekouras commented on PR #3920:


21 months ago
#19

@felixarntz there are already places in GB that use these classes and I think it's better not to add too many things in this PR. We're waiting for this PR to land in order to do the first packages release in this PR, where you can see how it's used.

@hellofromTonya commented on PR #3920:


21 months ago
#20

So TLDR I don't think the @todo comment is relevant here as it's an actual debugging use-case here.

Thanks for clarifying @felixarntz! Hey @dmsnell, I'll update my suggestions to remove the @todo requirement. It seems it is not needed.

@flixos90 commented on PR #3920:


21 months ago
#22

Thanks @gziolo, this is super helpful.

@dmsnell commented on PR #3920:


21 months ago
#23

@felixarntz I forgot that I'm not a member of the WordPress-develop repo so I can't create a PR against this one and have it show up here, but I've created it in my fork at https://github.com/dmsnell/wordpress-develop/pull/1

I could recreate that so it's targeting trunk here and it will show up in the list of PRs in this repo, but I was thinking reducing the noise might help. Turns out even WordPress/Gutenberg#46625 is a bit tricky to port over because half of it applies to elements, which exists in WordPress already, while the other half applies to settings, which is new in 6.2.

I'm happy to rearrange the commits, branches, and PR to your desires, just let me know your preferences.

@flixos90 commented on PR #3920:


21 months ago
#24

Thanks @dmsnell, that works well enough. I'll have a look.

@dmsnell commented on PR #3920:


21 months ago
#25

Pushed 28e9bf3595 which included the next-to-last item I really wanted to hit pre-merge. It's mainly a combination of renames of internal components and expansions of comments which clarify what I think has been the most confusing part of this system for me - get_updated_html(). If it breaks any of the tests accidentally then I'll fix those.

The last item is something I'd like someone to answer who knows: can we PCRE patterns with the Unicode flag active in Core? If we can't, we might as well remove the attribute name check, or at least the parts of it which validates against characters above the US-ASCII range. That's not a very critical part of the system, and removing it would be easier at this point than introducing a UTF-8 decoder.

The only remaining thing I know of that I would eventually like to update in this class is adding proper handling of HTML character references. This _is not intended for 6.2_, but possibly 6.3. A working PR has been stewing in early form in WordPress/Gutenberg#47040 and is less of an API/interface change and more of a bug fix (sadly PHP does not and never has had a proper character reference decoder). In the meantime most people won't notice the non-spec-compliant behavior, and such broken behavior is universal where anyone relies on html_entity_decode(). This is neither unique to the Tag Processor nor is it introduced by it.

@dmsnell commented on PR #3920:


21 months ago
#26

If it pleases all involved I would prefer to rewrite this branch before merge so that we can ensure that the commit message and commits are proper, communicative, and relevant.

I cannot request changes on my own PR, but I'd appreciate it if someone would proxy-request-changes on my behalf so that we don't accidentally merge before that last communication step occurs. cc: @ockham

@azaozz commented on PR #3920:


21 months ago
#27

don't merge PRs but have to commit to svn

Right, the basic workflow is to take all the changes in the PR as a .diff from here: https://patch-diff.githubusercontent.com/raw/WordPress/wordpress-develop/pull/3920.diff then apply that diff to an SVN checkout of core trunk and commit it from there. (See the View Patch button under the description on https://core.trac.wordpress.org/ticket/57575).

@dmsnell commented on PR #3920:


21 months ago
#28

@costdev I've run through all the tests and removed or adjusted most of the @covers calls. In many cases there's no clear yes or no to whether the test covers a given set of methods. get_updated_html()'s behavior is covered by almost all test methods, but not directly as if testing it's behavior, because the interplay between all of the modification methods and get_updated_html() is deeply coupled.

To that end I've tried to reframe those in terms of what semantic operations are intended to be covered. I couldn't find a good way to be consistent the way I think you wanted it to be, so this is the best way to say, "for someone wanting to use set_attribute() to change a document, this test is covering that case, even though it's impossible to test this only by examining set_attribute()"

@anton-vlasenko I've added doing_it_wrong everywhere that we're reporting something as a guaranteed failure. thanks for that recommendation.

@dmsnell commented on PR #3920:


21 months ago
#29

Does anyone know the proper way to assert in a test that _doing_it_wrong was called?

@peterwilsoncc commented on PR #3920:


21 months ago
#30

Does anyone know the proper way to assert in a test that _doing_it_wrong was called?

You can use setExpectedIncorrectUsage per

https://github.com/WordPress/wordpress-develop/blob/ab7f91562deaaa5eac5fd64c11dd38fe687f48db/tests/phpunit/tests/db.php#L1521

#31 @gziolo
21 months ago

#56299 was marked as a duplicate.

@gziolo commented on PR #3920:


21 months ago
#32

don't merge PRs but have to commit to svn

Right, the basic workflow is to take all the changes in the PR as a .diff from here: https://patch-diff.githubusercontent.com/raw/WordPress/wordpress-develop/pull/3920.diff then apply that diff to an SVN checkout of core trunk and commit it from there. (See the View Patch button under the description on https://core.trac.wordpress.org/ticket/57575).

I love this command that takes care of everything:

{{{bash
npm run grunt patch:https://github.com/WordPress/wordpress-develop/pull/3920
}}}

@hellofromTonya commented on PR #3920:


21 months ago
#33

Hey all 👋

### Status Update:

  • Changes requested:

Several committers have reviewed and requested changes, none of which affect functionality, but are essential for commit consideration (as we committers require the same kinds of changes for all code being committed into Core).

  • @felixarntz, the 6.2 Performance Lead, is actively doing a performance assessment.

I asked him to do this assessment. Why? To understand with data if there could be impacts to users who have blocks/content that will now consume this API when they upgrade to 6.2. _With data_ this unknown concern can either be (a) eliminated or (b) known for further consideration.

Why? Native PHP RegEx and str functionality is faster than this API. So it'll be good to know if there are performance impacts for users upgrading to 6.2.

### Commit Proposal

GOAL: Commit today IF ALL are met:

  1. All of the requested changes from committers is committed into this PR.
  2. No significant performance regressions are found.

If/when both of the above are met, then commit.
Psst, I pinged @azaozz for commit consideration during his work day (as he works past my end of day).

Why today?
This API is blocking other feature/enhancement backports. Beta 1 is Tuesday. Let's do our best to resolve what can be resolved today. Then identify unknowns or concerns without data to be observed and monitored through the Beta cycle.

How does this plan sound? 👍 or 👎
@felixarntz @dmsnell @ockham @ntsekouras @azaozz @costdev @gziolo @anton-vlasenko

@dmsnell commented on PR #3920:


21 months ago
#34

Correct me if I am wrong but I think I noticed parsing of script tags accounted for the old school <script></script> but couldn't see any tests for that. Unless I've missed something, could that be tested too.

@peterwilsoncc you should find several tests for script tag processing, including the case you mentioned, in `test_next_tag_ignores_script_tag_contents`. The exact string you noted isn't in the tests, but it's covered by existing tests and handled properly.

#35 @flixos90
21 months ago

  • Focuses performance added

#36 @hellofromTonya
21 months ago

  • Keywords changes-requested added

Marking with changes-requested to reflect the current state of the PR 3920.

@hellofromTonya commented on PR #3920:


21 months ago
#37

Status Update:

Following up on https://github.com/WordPress/wordpress-develop/pull/3920#issuecomment-1413760031.

  • All change requested are complete ✅

What's left for it to be committed?

  • Committer approval: currently there are no committers who have approved this PR yet
  • Completion of the performance assessment I asked @felixarntz to do ⚪

My weekend starts soon (well as soon as I clear my backport review and commit queue - not including this PR). I'll be back on Monday. The decision to commit this PR before Monday is in @azaozz and @felixarntz hands.

@azaozz @felixarntz IF there's consensus to commit it, then please commit it as soon as it's approved. Why? Other work is dependent upon this PR. So IF it's going into 6.2, sooner is better than later.

Thank you everyone for contributions, patience, and expertise!

cc @ntsekouras and @Mamaduka to keep you aware.

@hellofromTonya commented on PR #3920:


21 months ago
#38

For transparency, I was asked today:
Why not just commit it and then do the performance regression testing during Beta or RC?

Here's my response:

The reason why I asked to get it done before it gets committed is this:

If there's a performance regression that warrants this not going in, then all of the other work this is dependent upon it needs to be reworked and plans changed. That needs to known ASAP. Waiting to do it later means putting those features and the 6.2 release at risk.

@flixos90 commented on PR #3920:


21 months ago
#39

I ran a performance test for this via #3971, a PR purely intended for testing, which is directly based on this PR here, but also includes relevant usage from #3914 and https://github.com/dmsnell/wordpress-develop/pull/1 (see https://github.com/WordPress/wordpress-develop/commit/8d7ebfb21c643088139e178e2f67bedecc49791b).

TLDR: The performance impact appears to be negligible, which is a great outcome. There is no clear tendency of increase or decline in performance. Overall the data, if anything, indicates a this PR being a tiny bit faster than trunk today. The difference is too small to be able to surely say that, but at least it should be clear now that this is not a concern from a performance perspective. 👍

The test is based on the Gutenberg demo content, tested with the last three default themes, using the median from 20 requests in each scenario (#3971 vs trunk).

See this spreadsheet for the data and also notes on how exactly the test was conducted.

So I believe this should be unblocked now. Thanks everyone for the patience, I am really glad to see this positive outcome.

@hellofromTonya commented on PR #3920:


21 months ago
#40

What is the performance regression test I've asked for?

For blocks or features that are in Core now (<=6.1) but will change to use this API, what is the performance impact when users upgrade to 6.2?

Why?
This API will execute a lot of PHP code vs running RegEx and/or PHP native string functions. _It will have an impact_ How much of an impact is unknown. Why? This type of testing has not happened yet.

The data collected from a regression test is needed to know what the impact will be to users upgrading to 6.2. That information will then help with decision making about if that impact is acceptable or not.

There are many benefits to this API. For example, it will make maintaining string manipulation much easier and understandable. It will help resolve the challenges of working with HTML markup in PHP (whitespaces, malformed, etc.), though these challenges have always existed. That said, it also has be performant to not cause issues for users.

@dmsnell commented on PR #3920:


21 months ago
#41

Note to whoever commits this: I've posted a commit message in a details tag in the description at the top. A couple typos are fixed from the git commit message, and I've updated the props syntax for core instead of for GitHub

#42 @hellofromTonya
21 months ago

  • Keywords changes-requested removed

Removing the changes-requested keyword as all changes have been resolved in the PR.

#43 @azaozz
21 months ago

  • Owner set to azaozz
  • Resolution set to fixed
  • Status changed from new to closed

In 55203:

Introduce HTML API with HTML Tag Processor

This commit pulls in the HTML Tag Processor from the Gutenbeg repository.

The Tag Processor attempts to be an HTML5-spec-compliant parser that provides the ability in PHP to find specific HTML tags and then add, remove, or update attributes on that tag. It provides a safe and reliable way to modify the attribute on HTML tags.

More information: https://github.com/WordPress/wordpress-develop/pull/3920.

Props: antonvlasenko, bernhard-reiter, costdev, dmsnell, felixarntz, gziolo, hellofromtonya, zieladam, flixos90, ntsekouras, peterwilsoncc, swissspidy, andrewserong, onemaggie, get_dave, aristath, scruffian, justlevine, andraganescu, noisysocks, dlh, soean, cbirdsong, revgeorge, azaozz.
Fixes #57575.

#45 @azaozz
21 months ago

In 55206:

Fix couple of typos in inline docs.

Props: ironprogrammer.
See #57575.

@zieladam commented on PR #3920:


21 months ago
#46

This is amazing, I really appreciate everyone's collaboration on this one ❤️

#47 @ironprogrammer
21 months ago

  • Keywords has-testing-info added

Though this has already been merged, I'm adding example test instructions here to help test contributors reviewing this during 6.2's prerelease cycle.

Please do not report bugs on this ticket, but submit as a new Trac ticket for appropriate tracking and triage.

Testing Instructions

Steps to Test

  1. Install the `WP_HTML_Tag_Processor` test plugin (installation info is in the plugin's header).
  2. Enable the plugin.
  3. Create a post, and insert a paragraph block. Add some text.
  4. In the paragraph text, add hyperlinks to internal and external URLs, then publish the post.
  5. View the post.
  6. Observe the rendered output, as well as the modified markup (via View Source).

Expected Results

  • ✅ External links in the paragraph are formatted bold with contrast-colored text, and prepended with the "external" dashicon.

Supplemental Artifacts

https://cldup.com/EJ0VQSmbwa.png
Figure 1: Example output from test plugin.

Additional Information

Testers wishing to "tinker" with the HTML Tag Processor API might also consider trying this "sandbox" plugin, which provides fast visual feedback of the new feature, and demonstrates some ways in which the API can be used.

This ticket was mentioned in PR #4112 on WordPress/wordpress-develop by @dmsnell.


21 months ago
#48

When the HTML API was introduced a number of fields were switched from private visibility to protected so that Gutenberg and other systems could more easily enhance the behaviors through subclassing. The $this->html property was overlooked but important for systems using the Tag Processor to stich HTML, specifically performing operations on innerHTML and innerText.

This patch updates that visibility to what it should have been during the introducing of the API in 39bfc2580d9b0ea7e6ff45b8d904408d925cbc4b.

Trac ticket: https://core.trac.wordpress.org/ticket/57575.

cc: @adamziel @ockham @azaozz

@gziolo commented on PR #4112:


21 months ago
#49

We can reuse https://core.trac.wordpress.org/ticket/57575. I'll take care of it.

#50 @gziolo
21 months ago

In 55402:

HTML API: Set $this->html to protected to support subclassing

When the HTML API was introduced a number of fields were switched from private visibility to protected so that Gutenberg and other systems could more easily enhance the behaviors through subclassing. The $this->html property was overlooked but important for systems using the Tag Processor to stich HTML, specifically performing operations on innerHTML and innerText.

Follow-up [55203].
Props dmsnell.
See #57575.

@zieladam commented on PR #4112:


21 months ago
#52

Just noting there's a few more properties that will potentially have to become public to support full HTML parsing – I'm running into access issues now in my other PR. None of that is important to include in 6.2, though, as we still don't know what will the new API look like.

This ticket was mentioned in PR #4115 on WordPress/wordpress-develop by @zieladam.


21 months ago
#53

## Description

With this PR seek() correctly finds bookmarks set on tag closers. It:

  • Uses the correct starting index for tag closer bookmarks
  • Adds array( 'tag_closers' => 'visit' ) in seek()

### About bookmark start indices:

Setting a bookmark on a tag should set its "start" position before the opening "<", e.g.:

<div> Testing a <b>Bookmark</b>
----------------^

The current calculation assumes this is always one byte to the left from $tag_name_starts_at.

However, in tag closers that index points to a solidus symbol "/":

<div> Testing a <b>Bookmark</b>
----------------------------^

The bookmark should therefore start two bytes before the tag name:

<div> Testing a <b>Bookmark</b>
---------------------------^

Trac ticket: https://core.trac.wordpress.org/ticket/57575

cc @ockham @dmsnell @gziolo

#54 @hellofromTonya
21 months ago

Opened #57787 for refining and fixing the 1 byte off issue in @zieladam's PR https://github.com/WordPress/wordpress-develop/pull/4115.

#55 @hellofromTonya
21 months ago

  • Keywords needs-dev-note add-to-field-guide added

@dmsnell commented on PR #4112:


21 months ago
#56

@adamziel $bookmarks is already protected in Core.
are you using a pre-merge version of the code?

@zieladam commented on PR #4112:


21 months ago
#58

Oh! Hm, perhaps I am. Thanks!

@azaozz commented on PR #4112:


20 months ago
#59

@azaozz I included you here because I'd like to try and get this into 6.2 still. Should I create a Trac ticket for it?

Uh, sorry for the late response, seeing this just now :(

Yes, as the HTML API is already in core, any bugfixes or "code cleanup" can be added during beta. Bugfixes can also be added during RC but it is usually reserved for "bigger" bugs and edge cases, for example if there's a fatal error in some circumstances, etc.

For small changes like this one there's usually no need for a new ticket. New trac tickets are good for larger bugfixes, and for bugs that are fixed during RC.

Documentation (docblock) and tests can be added at any time and usually it is a good idea to have new tickets for them especially if the changes are more substantial.

#60 @hellofromTonya
20 months ago

In 55469:

HTML API: Fix finding RCData and Script tag closers.

Fixes finding the following tag closers </script>, </textarea>, and </title> in WP_HTML_Tag_Processor.

Follow-up to [55407], [55203].

Props zieladam, dmsnell, hellofromTonya.
Fixes #57852.
See #57575.

#61 @hellofromTonya
20 months ago

In 55477:

HTML API: Document shorthand usage of the next_tag().

Documents the shorthand usage, i.e. $this->next_tag( 'img' ), of WP_HTML_Tag_Processor::next_tag().

Also includes table alignments and formatting adjustments in the class docs.

Follow-up to [55203], [55206].

Props zieladam, poena, dmsnell, costdev, hellofromTonya.
Fixes #57863.
See #57575.

#62 @SergeyBiryukov
20 months ago

  • Component changed from Editor to HTML API

This ticket was mentioned in PR #6713 on WordPress/wordpress-develop by @zieladam.


5 months ago
#63

Trac Ticket: Core-61365

## What

Proposes an XML Tag Processor and XML Processor to complement the HTML Tag processor and the HTML Processor.

The XML API implements a subset of the XML 1.0 specification and supports documents with the following characteristics:

  • XML 1.0
  • Well-formed
  • UTF-8 encoded
  • Not standalone (so can use external entities)
  • No DTD, DOCTYPE, ATTLIST, ENTITY, or conditional sections

The API and ideas closely follow the HTML API implementation. The parser is streaming in nature, has a minimal memory footprint, and leaves unmodified markup as it was originally found.

It also supports streaming large XML documents, see https://github.com/adamziel/wordpress-develop/pull/43 for proof of concept.

## Design decisions

### Ampersand handling in text and attribute values

XML attributes cannot contain the characters < or &.

Enforcing < is fast and easy. Enforcing & is slow and complex because ampersands are actually allowed when used as the start. This means we'd have to decode all entities as we scan through the document – it doesn't seem to be worth it.

Right now, the WP_XML_Tag_Processor will only bale out when attempting to explicitly access an attribute value or text containing an invalid entity reference.

### Accepting all byte sequences up to < as text data

XML spec defines text data as follows:

[2] | Char | ::= | #x9 \| #xA \| #xD \| [#x20-#xD7FF] \| [#xE000-#xFFFD] \| [#x10000-#x10FFFF] | /* any Unicode character, excluding the surrogate blocks, FFFE, and FFFF. */

Currently WP_XML_Tag_Processor does not attempt to reject bytes outside of these ranges. It will treat everything between valid elements as text data. On the upside, we avoid using an expensive preg_match() call for processing text. On the downside, we won't bale out early when processing a malformed, truncated, or a non-UTF-8 document.

I think it's a good trade-off to take, but I'm also happy to discuss.

### Entity decoding

This API only decodes the following entities:

  • Five character references mandated by the XML specification, that is &amp;, &lt;, &gt;, &quot;, &apos;
  • Numeric character references, e.g. &#123; or &#x1A;

Other entities trigger a parse error. This API could use the full HTML decoder, but it would be wrong and slow here. If lack of support for other entities ever becomes a problem, html_entity_decode() would be a good replacement for the custom implementation. It's worth noting that in XML1 mode it doesn't decode many HTML entities:

> html_entity_decode( '&oacute;' );
string(2) "ó"
> html_entity_decode( '&oacute;', ENT_XML1, 'UTF-8' );
string(8) "&oacute;"

## Follow-up work

  • Usage examples
  • Document the WP_XML_Processor class as well as the WP_HTML_Processor class is documented today.
  • Add a ton more unit tests for other compliance and failure modes.

## Out of scope for the foreseeable future

cc @dmsnell @sirreal

@dmsnell commented on PR #6713:


5 months ago
#64

@adamziel I pushed some changes to the decoder to avoid mixing the HTML and XML parsing rules. Sadly, while I thought that PHP's html_entity_decode( $text, ENT_XML1 ) might be sufficient, it allows capital X in a hexadecimal numeric character reference, which is a divergence from the spec.

In mucking around I became aware of how much more the role of errors is going to have to play in an XML API. I don't have any idea what's best. Character encoding failures I would assume are going to be fairly benign as long as we treat those failures as plaintext instead of actually decoding them, but that's a point to ponder.

This is going to get interesting with documents mixing HTML and XML, such as WXR. We're going to need to ensure that the tag and text parsing rules are properly separated. I'm still not sure what that means for us when we find something like a WXR without proper escaping of the content inside.

@zieladam commented on PR #6713:


5 months ago
#65

You're way too fast for me to keep up on this, but it shouldn't be that hard, because we will know where the start and end of a tag is, or if it's an empty tag, we know the full token.

Streaming makes it a bit more difficult, e.g. we may not have the closer yet, or we may not have the opener anymore. Perhaps pausing on incomplete input before yielding the tag opened would be useful here.

@zieladam commented on PR #6713:


5 months ago
#66

@adamziel I pushed some changes to the decoder to avoid mixing the HTML and XML parsing rules. Sadly, while I thought that PHP's html_entity_decode( $text, ENT_XML1 ) might be sufficient, it allows capital X in a hexadecimal numeric character reference, which is a divergence from the spec.

Oh dang it! Too bad.

In mucking around I became aware of how much more the role of errors is going to have to play in an XML API. I don't have any idea what's best. Character encoding failures I would assume are going to be fairly benign as long as we treat those failures as plaintext instead of actually decoding them, but that's a point to ponder.

Yes, that struck me too. At minimum, we'll need to communicate on which byte offset the error has occurred. Ideally, we'd show the context of the document, highlight the relevant part, and give a highly informative error message.

This is going to get interesting with documents mixing HTML and XML, such as WXR. We're going to need to ensure that the tag and text parsing rules are properly separated. I'm still not sure what that means for us when we find something like a WXR without proper escaping of the content inside.

I'm not sure I follow. By escaping of the content do you mean, say, missing <<CDATA[ opener, or having an HTML CDATA-lookalike comment inside of an XML CDATA section? There isn't much we can do, other than marking specific tags as PCDATA and stripping the initial CDATA opener and final CDATA closer.

@dmsnell commented on PR #6713:


5 months ago
#67

Streaming makes it a bit more difficult, e.g. we may not have the closer yet, or we may not have the opener anymore. Perhaps pausing on incomplete input before yielding the tag opened would be useful here.

My plan with the HTML API is to allow a "soft limit" on memory use. If we need to we can add an additional "hard limit" where it will fail. Should content come in and we're still inside a token, we just keep ballooning past the soft limit until we run out of memory, hit the hard limit, or parse the token.

So in this way I don't see streaming as a problem. The goal is to enable low-latency and low-overhead processing, but if we have to balloon in order to handle more extreme documents we can _break the rules_ as long as it's not too wild.

I prototyped this with the 🔔 Notifications on WordPress.com though in a slightly different way. The Notifications Processor has a runtime duration limit and a length limit, counting code points in the text nodes of the processed document. If it hits the runtime duration limit, it stops processing formatting and instead rapidly joins the remaining text nodes as unformatted plaintext. If it hits the length limit it stops processing altogether.

I believe that this Processor framework opens up new avenues for constraints and graceful degradation beyond those limits.

Yes, that struck me too. At minimum, we'll need to communicate on which byte offset the error has occurred. Ideally, we'd show the context of the document, highlight the relevant part, and give a highly informative error message.

This could be an interesting operating mode: when bailing, produce an Elm/Rust-quality error message. The character reference errors make me think we also have some notion of recoverable and non-recoverable errors. The character reference error does not cause syntactic issues, so we can zoom past them if we want, collecting them along the way into a list of errors. Errors for things like invalid tag names, attribute names, etc… are similar.

Missing or unexpected tags though I could see as being more severe since they have ambiguous impact on the rest of the document.

I'm not sure I follow. By escaping of the content do you mean, say, missing <<CDATA[ opener, or having an HTML CDATA-lookalike comment inside of an XML CDATA section? There isn't much we can do, other than marking specific tags as PCDATA and stripping the initial CDATA opener and final CDATA closer.

Some WXRs I've seen will have something like <content:encoded><[CDATA[<p>This is a post</p>]]></content:encoded>. Others, instead of relying on CDATA directly encode the HTML (Blogger does this) and it looks like <content:encoded>&lt;p&gt;This is a post&lt;/p&gt;</content:encoded>. The former is more easily recognizable.

I _believe_ that I've seen WXRs that are broken by not encoding the HTML either way, and these are the ones that scare me: <content:encoded><p>This is a post</p></content:encoded>. Or maybe it has been (non-WXR) RSS feeds where I've seen this.

Because the embedded document isn't encoded there's no boundary to detect it. I think this implies that for WordPress, our XML parser could have reason to be itself a blend of XML and HTML. For example:

  • If a tag is known to be an HTML tag interpret it as part of the surrounding text content.
  • Like you brought up, list known WXR or XML tags and treat them differently.
  • Directly encode rules for HTML-containing tags that we see in practice. We could have a list, even a list of breadcrumbs where they may be found.

This exploration is quite helpful because I think it's starting to highlight how the shape of WordPress' XML parsing needs differ from those of HTML.

@dmsnell commented on PR #6713:


5 months ago
#68

💡This will generally not be in the most critical performance hot path. We can probably relax some of the excessive optimizations, like relying on more functions to separate concepts like parse_tag_name(), parse_attribute_name(), and the like. This modularity would probably aid in comprehension, particularly since XML's rules are more constrained than HTML's.

@zieladam commented on PR #6713:


3 days ago
#69

I ported this work to https://github.com/WordPress/wordpress-playground/ and experimentally merged the two XML processors into a single WP_XML_Processor class. Let's see how it goes.

Note: See TracTickets for help on using tickets.