Make WordPress Core

Opened 7 weeks ago

Last modified 6 weeks ago

#61365 new enhancement

Editor: Introduce XML Tag Processor and XML Processor

Reported by: zieladam's profile zieladam Owned by:
Milestone: Awaiting Review Priority: normal
Severity: normal Version:
Component: HTML API Keywords: has-patch has-unit-tests
Focuses: Cc:

Description

This proposes an XML Tag Processor and XML Processor to complement the HTML Tag processor and the HTML Processor.

The explorations have started in https://github.com/WordPress/wordpress-develop/pull/6713.

The XML API implements a subset of the XML 1.0 specification and supports documents with the following characteristics:

  • XML 1.0
  • Well-formed
  • UTF-8 encoded
  • Not standalone (so can use external entities)
  • No DTD, DOCTYPE, ATTLIST, ENTITY, or conditional sections

The API and ideas closely follow the HTML API implementation. The parser is streaming in nature, has a minimal memory footprint, and leaves unmodified markup as it was originally found.

This description is mostly a placeholder for now to reference in GitHub and source code. It will be fleshed out more in the future.

Change History (15)

This ticket was mentioned in PR #6713 on WordPress/wordpress-develop by @zieladam.


7 weeks ago
#1

Trac Ticket: https://core.trac.wordpress.org/ticket/61365

## What

Proposes an XML Tag Processor and XML Processor to complement the HTML Tag processor and the HTML Processor.

The XML API implements a subset of the XML 1.0 specification and supports documents with the following characteristics:

  • XML 1.0
  • Well-formed
  • UTF-8 encoded
  • Not standalone (so can use external entities)
  • No DTD, DOCTYPE, ATTLIST, ENTITY, or conditional sections

The API and ideas closely follow the HTML API implementation. The parser is streaming in nature, has a minimal memory footprint, and leaves unmodified markup as it was originally found.

## Design decisions

### Ampersand handling in text and attribute values

XML attributes cannot contain the characters < or &.

Enforcing < is fast and easy. Enforcing & is slow and complex because ampersands are actually allowed when used as the start. This means we'd have to decode all entities as we scan through the document – it doesn't seem to be worth it.

Right now, the WP_XML_Tag_Processor will only bale out when attempting to explicitly access an attribute value or text containing an invalid entity reference.

## Remaining work

  • Document the WP_XML_Processor class as well as the WP_HTML_Processor class is documented today.

## Out of scope and future work

cc @dmsnell @sirreal

@zieladam commented on PR #6713:


7 weeks ago
#2

There's only one actual test failure on PHP 7:

14) Tests_XmlApi_WpXmlTagProcessor::test_token_bookmark_span with data set "DIV end tag with attributes" ('</wp:content wp:post-type="x"..."yes">', 1, '</wp:content wp:post-type="x"..."yes">')
Bookmarked wrong span of text for full matched token.
Failed asserting that two strings are identical.
--- Expected
+++ Actual
@@ @@
-'</wp:content wp:post-type="x" disabled="yes">'
+''

The rest of it is the PHPUnit setup complaining about _doing_it_wrong messages in tests exercising the different failure modes. I tried resolving that with expectedIncorrectUsage and $this->setExpectedIncorrectUsage, but I can't figure it out.

@zieladam commented on PR #6713:


7 weeks ago
#3

I'm on the fence about get_inner_text(). It feels wrong, but it's quite useful for parsing WXR:

I ended up removing get_inner_text() – it made streaming difficult and extracting inner text is easy:

$wxr = file_get_contents(__DIR__ . '/test.wxr');
$tokens = stream_next_xml_token(chunk_text($wxr));
foreach($tokens as $processor) {
    if (
        $processor->get_token_type() === '#cdata-section' && 
        $processor->matches_breadcrumbs(array('content:encoded'))
    ) {
        echo "\n " . dump_token($processor);
    }
}

@jonsurrell commented on PR #6713:


7 weeks ago
#4

Have you seen the Extensible Markup Language (XML) Conformance Test Suites? It may be a helpful resource to find a lot of test cases, although after a quick scan many cases seem to violate "No DTD, DOCTYPE, ATTLIST, ENTITY, or conditional sections"

@zieladam commented on PR #6713:


6 weeks ago
#5

Have you seen the Extensible Markup Language (XML) Conformance Test Suites? It may be a helpful resource to find a lot of test cases, although after a quick scan many cases seem to violate "No DTD, DOCTYPE, ATTLIST, ENTITY, or conditional sections"

I haven't seen that one, thank you for sharing! With everything that's going on in Playground I may not be able to iron out all the remaining nuances here – would you like to take over at one point once XML parsing becomes a priority?

@dmsnell commented on PR #6713:


6 weeks ago
#6

Remember too that text appears in XML outside of CDATA sections. Any text content between tags is still part of the inner text. CDATA is there to allow raw content without escaping it. For example, it's possible to embed HTML as-is inside CDATA whereas outside of it, all of the angle brackets and ampersands would need to be escaped.

In effect, CDATA could be viewed transparently in the decoder as part of any other text segment, where the only difference is that CDATA decoding is different than normal text decoding.

Probably need to examine the comment parser because XML cannot contain -- inside a comment. What @sirreal recommended about the test suite is a good idea - it would uncover all sorts of differences with HTML. Altogether though I don't think WordPress needs or wants a fully-spec-compliant parser. We want one that addresses our needs, which is a common type of incorrectly-generated XML. Maybe we could have a parser mode to reject all errors, but leave it off by default.

@zieladam commented on PR #6713:


6 weeks ago
#7

@dmsnell Text is supported, or at least should be – is there a specific scenario that this PR doesn't handle today?

@dmsnell commented on PR #6713:


6 weeks ago
#8

@adamziel it was specifically your get_inner_text() replacement above

@zieladam commented on PR #6713:


6 weeks ago
#9

Oh that replacement is just overly selective, it can target text nodes, too.

By the way, there's no easy way of setting text in a tag without any text nodes. I'm noodling on emitting an empty text node, but haven't yet found a good way of doing it. Another idea would be to couple the tag tag token with the text that follows it, but that sounds unintuitive for the consumer of this API.

@dmsnell commented on PR #6713:


6 weeks ago
#10

By the way, there's no easy way of setting text in a tag without any text nodes.

You're way too fast for me to keep up on this, but it shouldn't be that hard, because we will know where the start and end of a tag is, or if it's an empty tag, we know the full token.

We'll just need to create a function like set_inner_text() which matches the different cases and replaces the appropriate token or tokens to make it happen.

@dmsnell commented on PR #6713:


6 weeks ago
#11

@adamziel I pushed some changes to the decoder to avoid mixing the HTML and XML parsing rules. Sadly, while I thought that PHP's html_entity_decode( $text, ENT_XML1 ) might be sufficient, it allows capital X in a hexadecimal numeric character reference, which is a divergence from the spec.

In mucking around I became aware of how much more the role of errors is going to have to play in an XML API. I don't have any idea what's best. Character encoding failures I would assume are going to be fairly benign as long as we treat those failures as plaintext instead of actually decoding them, but that's a point to ponder.

This is going to get interesting with documents mixing HTML and XML, such as WXR. We're going to need to ensure that the tag and text parsing rules are properly separated. I'm still not sure what that means for us when we find something like a WXR without proper escaping of the content inside.

@zieladam commented on PR #6713:


6 weeks ago
#12

You're way too fast for me to keep up on this, but it shouldn't be that hard, because we will know where the start and end of a tag is, or if it's an empty tag, we know the full token.

Streaming makes it a bit more difficult, e.g. we may not have the closer yet, or we may not have the opener anymore. Perhaps pausing on incomplete input before yielding the tag opened would be useful here.

@zieladam commented on PR #6713:


6 weeks ago
#13

@adamziel I pushed some changes to the decoder to avoid mixing the HTML and XML parsing rules. Sadly, while I thought that PHP's html_entity_decode( $text, ENT_XML1 ) might be sufficient, it allows capital X in a hexadecimal numeric character reference, which is a divergence from the spec.

Oh dang it! Too bad.

In mucking around I became aware of how much more the role of errors is going to have to play in an XML API. I don't have any idea what's best. Character encoding failures I would assume are going to be fairly benign as long as we treat those failures as plaintext instead of actually decoding them, but that's a point to ponder.

Yes, that struck me too. At minimum, we'll need to communicate on which byte offset the error has occurred. Ideally, we'd show the context of the document, highlight the relevant part, and give a highly informative error message.

This is going to get interesting with documents mixing HTML and XML, such as WXR. We're going to need to ensure that the tag and text parsing rules are properly separated. I'm still not sure what that means for us when we find something like a WXR without proper escaping of the content inside.

I'm not sure I follow. By escaping of the content do you mean, say, missing <<CDATA[ opener, or having an HTML CDATA-lookalike comment inside of an XML CDATA section? There isn't much we can do, other than marking specific tags as PCDATA and stripping the initial CDATA opener and final CDATA closer.

@dmsnell commented on PR #6713:


6 weeks ago
#14

Streaming makes it a bit more difficult, e.g. we may not have the closer yet, or we may not have the opener anymore. Perhaps pausing on incomplete input before yielding the tag opened would be useful here.

My plan with the HTML API is to allow a "soft limit" on memory use. If we need to we can add an additional "hard limit" where it will fail. Should content come in and we're still inside a token, we just keep ballooning past the soft limit until we run out of memory, hit the hard limit, or parse the token.

So in this way I don't see streaming as a problem. The goal is to enable low-latency and low-overhead processing, but if we have to balloon in order to handle more extreme documents we can _break the rules_ as long as it's not too wild.

I prototyped this with the 🔔 Notifications on WordPress.com though in a slightly different way. The Notifications Processor has a runtime duration limit and a length limit, counting code points in the text nodes of the processed document. If it hits the runtime duration limit, it stops processing formatting and instead rapidly joins the remaining text nodes as unformatted plaintext. If it hits the length limit it stops processing altogether.

I believe that this Processor framework opens up new avenues for constraints and graceful degradation beyond those limits.

Yes, that struck me too. At minimum, we'll need to communicate on which byte offset the error has occurred. Ideally, we'd show the context of the document, highlight the relevant part, and give a highly informative error message.

This could be an interesting operating mode: when bailing, produce an Elm/Rust-quality error message. The character reference errors make me think we also have some notion of recoverable and non-recoverable errors. The character reference error does not cause syntactic issues, so we can zoom past them if we want, collecting them along the way into a list of errors. Errors for things like invalid tag names, attribute names, etc… are similar.

Missing or unexpected tags though I could see as being more severe since they have ambiguous impact on the rest of the document.

I'm not sure I follow. By escaping of the content do you mean, say, missing <<CDATA[ opener, or having an HTML CDATA-lookalike comment inside of an XML CDATA section? There isn't much we can do, other than marking specific tags as PCDATA and stripping the initial CDATA opener and final CDATA closer.

Some WXRs I've seen will have something like <content:encoded><[CDATA[<p>This is a post</p>]]></content:encoded>. Others, instead of relying on CDATA directly encode the HTML (Blogger does this) and it looks like <content:encoded>&lt;p&gt;This is a post&lt;/p&gt;</content:encoded>. The former is more easily recognizable.

I _believe_ that I've seen WXRs that are broken by not encoding the HTML either way, and these are the ones that scare me: <content:encoded><p>This is a post</p></content:encoded>. Or maybe it has been (non-WXR) RSS feeds where I've seen this.

Because the embedded document isn't encoded there's no boundary to detect it. I think this implies that for WordPress, our XML parser could have reason to be itself a blend of XML and HTML. For example:

  • If a tag is known to be an HTML tag interpret it as part of the surrounding text content.
  • Like you brought up, list known WXR or XML tags and treat them differently.
  • Directly encode rules for HTML-containing tags that we see in practice. We could have a list, even a list of breadcrumbs where they may be found.

This exploration is quite helpful because I think it's starting to highlight how the shape of WordPress' XML parsing needs differ from those of HTML.

@dmsnell commented on PR #6713:


6 weeks ago
#15

💡This will generally not be in the most critical performance hot path. We can probably relax some of the excessive optimizations, like relying on more functions to separate concepts like parse_tag_name(), parse_attribute_name(), and the like. This modularity would probably aid in comprehension, particularly since XML's rules are more constrained than HTML's.

Note: See TracTickets for help on using tickets.