Opened 7 weeks ago
Last modified 6 weeks ago
#62091 new feature request
XML API: Produce XML Serialization of HTML (XHTML)
Reported by: | dmsnell | Owned by: | |
---|---|---|---|
Milestone: | Future Release | Priority: | normal |
Severity: | normal | Version: | trunk |
Component: | HTML API | Keywords: | has-patch |
Focuses: | Cc: |
Description (last modified by )
Even though XML cannot represent all possible HTML documents, and even though it's dangerous to send XHTML content generally, there are extremely rare cases where it's useful to directly embed an HTML document into an existing XML document, if the given document can be expressed in XML.
What is required to transform when converting HTML to XML?
- HTML void elements like
<img>
should adopt the self-closing flag to become<img />
- HTML text should be decoded and then only
<
,>
,&
,"
, and'
ought to be re-encoded. - Namespace transitions should involve changes to the default namespace.
- When entering a foreign element (
SVG
andMATH
). - When returning to HTML from a foreign element.
- When entering HTML integration points, such as
FOREIGNOBJECT
andANNOTATION-XML
with the proper attribute. - Containing element needs namespace prefix on tag name, e.g.
<svg:svg>
or<svg:foreignElement>
, and then we can update the default namespace on that element, but because the default namespace doesn't apply to attributes, and because namespaced attributes are different than non-namespaced attributes, we must leave the attributes un-namespaced.
- When entering a foreign element (
- Something has to be done about un-representable characters.
- Invalid UTF-8 bytes.
- Unicode non-characters and other disallowed characters.
- HTML documents which cannot be represented in XML should result in rejection - cannot serialize.
- The HTML doctype declaration probably needs to be removed.
Design
With the introduction of WP_HTML_Processor::serialize()
in #62036, an XML serialization might appear naturally as WP_HTML_Processor::serialize_to_xml()
. When parsing as a fragment, the output may be an XML fragment, while a full parser would produce a valid XHTML document including the XML declaration.
Please share your thoughts if you know of other transformations that need to occur.
XML and HTML are divergent languages. You probably don't want XHTML. It's dangerous.
Change History (5)
This ticket was mentioned in PR #7408 on WordPress/wordpress-develop by @dmsnell.
7 weeks ago
#1
- Keywords has-patch added
@siliconforks commented on PR #7408:
7 weeks ago
#2
php > var_dump( ( WP_HTML_Processor::create_fragment( '<svg><foreignObject><p>Test<svg><text>Smile</text></p></foreignObject><p>test' ) )->serialize_to_xml() ); string(200) "<svg xmlns="http://www.w3.org/2000/svg"><foreignObject xmlns="http://www.w3.org/1999/xhtml"><p>Test<svg xmlns="http://www.w3.org/2000/svg"><text>Smile</text></svg></p></foreignObject></svg><p>test</p>"php > var_dump( ( WP_HTML_Processor::create_full_parser( '<svg><foreignObject><p>Test<svg><text>Smile</text></p></foreignObject><p>test' ) )->serialize_to_xml() ); string(315) "<?xml version="1.0" encoding="UTF-8" ?> <html xmlns="http://www.w3.org/1999/xhtml"><head></head><body><svg xmlns="http://www.w3.org/2000/svg"><foreignObject xmlns="http://www.w3.org/1999/xhtml"><p>Test<svg xmlns="http://www.w3.org/2000/svg"><text>Smile</text></svg></p></foreignObject></svg><p>test</p></body></html>"
Are the above examples actually right? Is the xmlns="http://www.w3.org/1999/xhtml"
supposed to be on the foreignObject
element like that?
Compare the above code to the example here:
https://developer.mozilla.org/en-US/docs/Web/SVG/Element/foreignObject
- Exporting HTML content into an Atom feed without escaping it. HTML may/ought to be escaped like
<content type="html"><p>yay</></content>
, but if the document can be serialized into<content type="xhtml" xmlns="http://www.w3.org/1999/xhtml"><p>yay</p></content>
.
The above Atom example has basically the same issue - is the xmlns="http://www.w3.org/1999/xhtml"
supposed to be on the content
element?
Compare to the example here:
https://en.wikipedia.org/wiki/Atom_(web_standard)#Example_of_an_Atom_1.0_feed
7 weeks ago
#3
Thanks @siliconforks.
You're right, in that the new default namespace applies to the foreignObject
itself, which isn't correct. This PR is a big WIP though - honestly I would be just as happy if it always raised an exception 🙃
But I'm still exploring and trying to understand what needs to occur and how it can be done in order to transform as safely as possible. I'll add WIP
to the title.
@siliconforks commented on PR #7408:
6 weeks ago
#5
In your example, wouldn't you also need to bind the svg:
prefix to a namespace?
Like this (adding whitespace to make it more readable):
<svg xmlns="http://www.w3.org/2000/svg"> <svg:foreignObject xmlns="http://www.w3.org/1999/xhtml" xmlns:svg="http://www.w3.org/2000/svg"> <p>Hi</p> </svg:foreignObject> </svg>
...or this:
<svg xmlns="http://www.w3.org/2000/svg" xmlns:svg="http://www.w3.org/2000/svg"> <svg:foreignObject xmlns="http://www.w3.org/1999/xhtml"> <p>Hi</p> </svg:foreignObject> </svg>
Trac ticket: Core-62091
Built from WordPress/wordpress-develop#7331
Provides a mechanism to serialize an HTML fragment to the XML syntax. YOU PROBABLY SHOULDN'T USE THIS!!!!
REMEMBER that so-called "XHTML" served _without_ a path ending in
.xml
or without theContent-type: application/xml+xhtml
HTTP header _will render as HTML_ and ONE SHOULD NOT SERVE XML/XHTML as HTML!!!php > var_dump( ( WP_HTML_Processor::create_fragment( '<p>an <img> is worth Æ thousand words' ) )->serialize_to_xml() ); string(43) "<p>an <img /> is worth Æ thousand words</p>"
php > var_dump( ( WP_HTML_Processor::create_fragment( '<svg><foreignObject><p>Test<svg><text>Smile</text></p></foreignObject><p>test' ) )->serialize_to_xml() ); string(200) "<svg xmlns="http://www.w3.org/2000/svg"><foreignObject xmlns="http://www.w3.org/1999/xhtml"><p>Test<svg xmlns="http://www.w3.org/2000/svg"><text>Smile</text></svg></p></foreignObject></svg><p>test</p>"
Extremely rare cases when it's appropriate to use this
<content type="html"><p>yay</></content>
, but if the document can be serialized into<content type="xhtml" xmlns="http://www.w3.org/1999/xhtml"><p>yay</p></content>
.HTML generally cannot be expressed in XML, and according to the HTML specification, _Using the XML syntax is not recommended_! Prefer escaping the HTML to avoid corruption and data loss.