Opened 7 years ago
Last modified 2 months ago
#43258 accepted enhancement
Output buffer template rendering and add filter for post-processing (e.g. caching, optimization)
Reported by: |
|
Owned by: |
|
---|---|---|---|
Milestone: | 6.9 | Priority: | normal |
Severity: | normal | Version: | |
Component: | General | Keywords: | has-patch early |
Focuses: | docs, performance | Cc: |
Description
I see that more and more theme and plugin developers start to use output buffering functions for the whole site as they need to manipulate the site's content. For example:
- Cache the page
- Combine JS and CSS files
- Lad JS and CSS files for widgets only when needed
- Place SEO related things
As it is not officially available in WordPress, developers need to find their way to buffer the output. Probably the most common action is the 'template_redirect', where they can place ob_start()
Then they have to close their output buffer, probably the best action to do that is 'shutdown'.
It wouldn't be a problem, if this method only used once on your site. When multiple plugin or theme use this technique, they should close only their output buffers. As output buffers are LIFO stacked, it is very important to close in the order they were added.
For example:
Cache plugin:
<?php add_action('template_redirect', function(){ ob_start(); }); add_action('shutdown', function(){ $html = ob_get_clean(); //Let's cache the html and show it... });
CSS minify plugin:
<?php add_action('template_redirect', function(){ ob_start(); }); add_action('shutdown', function(){ $html = ob_get_clean(); //Let's find CSS files, minify them and replace the originals });
In this case the page will be cached and the CSS files will be minified afterwards which will slow down the site as they should be in reverse order. We can fix that with priority, but both 'template_redirect' and 'shutdown' should get the same priority to make sure we close the related output buffer.
Documentation
What I propose is to have an official documentation which suggests the right way to use output buffering. It would help prevent several conflicts between plugins and themes.
Future
It would be great to see in WordPress core an in-built output buffering system. Then the developers wouldn't need to start and close output buffers on their own. WordPress would do the output buffering and at the end it would allow the filtering of the content.
<?php echo apply_filters('wp_output', $output);
Attachments (1)
Change History (39)
#2
in reply to:
↑ 1
@
7 years ago
Replying to swissspidy:
That sounds a lot like treating the symptoms not the cause.
And what do you think, what is the cause?
#3
@
7 years ago
- Keywords 2nd-opinion added
What I propose is to have an official documentation which suggests the right way to use output buffering. It would help prevent several conflicts between plugins and themes.
If we were pursue some kind of "official" mechanism for output buffering, probably the best place for that documentation to live would be in the Theme Developer Handbook here: https://developer.wordpress.org/themes/
#4
@
7 years ago
I started to investigate how different plugins and themes use output buffering to modify the output of the page. Here you can check the collection: https://github.com/nextend/wp-ob-plugins-themes/blob/master/README.md
It would be great to hear feedback from other developers to find the preferred usage of output buffering and then create and official documentation on this topic.
#5
@
4 years ago
I propose the attached WP_Output_Buffer
class, which would be an optional feature what developers could enable and use when needed. It simply starts an output buffer and runs the 'output_buffer' filter on the content of the buffer which holds the whole output of the site.
Also the class gives suggested priorities for different use cases so developers can hook to the right point.
<?php <?php if (class_exists('WP_Output_Buffer')) { WP_Output_Buffer::enable(); add_filter('output_buffer', array( $this, 'prepareOutput' ), WP_Output_Buffer::DEFAULT_PRIORITIES['CONTENT']); } else { /** * The plugin and theme mechanism for old WordPress version which do not support this feature. */ }
Several huge plugins use global output buffers like:
- Wordfence Security @mmaunder
- Jetpack
- Really Simple SSL @rogierlankhorst
- SG Optimizer @hristo-sg
- LiteSpeed Cache @litespeedtech
- WP Fastest Cache @emrevona
- Autoptimize @optimizingmatters
- Smush @alexdunae
- W3 Total Cache @joemoto
- WP Rocket
- EWWW Image Optimizer @nosilver4u
- Smart Slider 3
and much more: wpdirectory.net => ob_start\( ?array and wpdirectory.net => ob_start\(('|")
#7
@
4 years ago
Absolutely!
However, if using an output buffer isn't the recommended method (which is what @swissspidy seems to be suggesting) I'd love to see some documentation on what the preferred way is to manipulate a HTML document in its entirety.
#8
in reply to:
↑ description
@
2 years ago
Replying to nextendweb:
Future
It would be great to see in WordPress core an in-built output buffering system. Then the developers wouldn't need to start and close output buffers on their own. WordPress would do the output buffering and at the end it would allow the filtering of the content.
<?php echo apply_filters('wp_output', $output);
Related: #58285
This ticket was mentioned in Slack in #core by sergey. View the logs.
2 years ago
#11
@
20 months ago
- Summary changed from Output buffering to Output buffer template rendering and add filter for post-processing (e.g. caching, optimization)
#12
@
20 months ago
- Focuses performance added
- Keywords 2nd-opinion removed
- Milestone changed from Awaiting Review to Future Release
In addition to standardizing output buffering for the sake of caching plugins and optimization plugins, core also would benefit from an output buffer to do its own post-processing optimizations for images. See #59331.
#13
@
16 months ago
One of the areas I want to explore with the HTML API is adding a new set of filters for final rendered content where we could scan the full HTML document on render and let plugins attach to different events on that scan. For example, one filter to give access to a tag and its attributes, another filter to process #text
node content between tags.
I'm optimistic that we'll be able to have something performant enough that if we can eliminate just a few of Core's existing filtering pipelines and replace them with this new single-pass transform that we'll break even on speed or even become faster than how things are today.
There is a heap of code out there doing full parsing of the HTML available to the filter, which often runs slow or stresses the available memory. I'd like to better understand what kinds of needs are out there leading developers to enable output buffering.
#14
@
16 months ago
This is something that we would be interested in participating in, as we make usage of this our main plugins to manage optimizations in the front-end output.
We do encounter from time to time issues with output buffering, when other plugins don't use it correctly.
#15
follow-up:
↓ 16
@
13 months ago
Note: I've proposed this as part of the Gutenberg experiment for full-page client-side navigation: https://github.com/WordPress/gutenberg/pull/61212
#16
in reply to:
↑ 15
@
12 months ago
Replying to westonruter:
Note: I've proposed this as part of the Gutenberg experiment for full-page client-side navigation: https://github.com/WordPress/gutenberg/pull/61212
This is slated to be part of Gutenberg 18.5.
#17
@
3 months ago
- Milestone changed from Future Release to 6.8
- Owner set to westonruter
- Status changed from new to accepted
Beyond page caching plugins and optimization plugins (e.g. Optimization Detective) which rely on output buffering, there are two specific optimizations which core could apply if output buffering were available, especially for classic themes:
- The large block library stylesheet could be split up into the block-specific stylesheets enabled via the
should_load_separate_core_block_assets
filter. (cf. performance#1834 https://github.com/WordPress/performance/issues/1834) - The
importmap
script could be moved from the footer to thehead
(seeWP_Script_Modules::add_hooks()
).
I'm going to milestone this for 6.8 since so much would be enabled by this.
This ticket was mentioned in Slack in #core-performance by westonruter. View the logs.
3 months ago
#19
@
3 months ago
I personally think this would be a great addition to WordPress Core. While Gutenberg's implementation is only an experiment and therefore not quite running at scale, output buffers are and have been heavily used by various popular products (e.g. full page caching plugins) for more than a decade.
That said, while it's a technically simple change to make and clearly has large benefits, it hasn't been in WordPress Core all these years although it could have - so the question is why. Are there any real concerns, or has just nobody been confident enough to push for adding it so far?
With those questions in mind, I think this should get signed off from at least a few seasoned committers. So it may be a bit too late at this point in the 6.8 cycle, with just 5 days left before Beta. We can still see if we can get such consensus quickly, but worth flagging the timeline.
#20
@
3 months ago
Reviewing the Gutenberg implementation in https://github.com/WordPress/gutenberg/pull/61212, I wonder whether we can do better than just filtering the entire HTML string.
Especially with the new performant HTML processor (see related comment 13), maybe we should mandate using it, i.e. filter only an instance of that class? I think that would actively discourage any of the bad patterns we have seen (and done) in the past, like using regex on HTML.
For use-cases that don't alter the HTML (such as caching plugins), we could still expose that string but in a read-only way, such as via a new action that is fired as part of the output buffering.
Long story short: We probably shouldn't go with the quick and simple approach of filtering the entire HTML string, but think about something that encourages best practices.
#21
follow-up:
↓ 22
@
3 months ago
@flixos90 I know that @dmsnell has had similar thoughts in the past. However, there are use cases beyond just processing HTML. For example, caching plugins don't need to do any processing at all. They just need to capture the output buffer to put in the persistent object cache (for example) and maybe append an HTML comment to say that it was cached.
Also, some applications on the output buffer would only need the lighter-weight HTML Tag Processor which doesn't have all of HTML's complicated parsing rules internalized, so such extensions shouldn't be required to use it. For example, Optimization Detective is mostly able to get by using the HTML Tag Processor by taking into account the most common HTML idiosyncrasies (e.g. being able to omit closing tags on P
tags, although WP is pretty good about having tags balanced). But Optimization Detective would be eventually be better off using the HTML Processor so scenarios like missing closing DIV
tags could be better handled. (Although in the end, the only impact is the XPath is not accurately computed, but it would still be stable to identify that tag regardless.) Note that Optimization Detective uses a subclass of WP_HTML_Tag_Processor
so it wouldn't be able to use a single instance supplied by core anyway.
Also, other use cases like I mentioned in the previous comment could be implemented without the use of the HTML API by instead injecting a placeholder into the HEAD
and then replacing it in the output buffer.
So I think adding a filter for the output buffer is the right approach, leaving the use of filter callbacks to decide how to process the HTML string.
#22
in reply to:
↑ 21
;
follow-up:
↓ 23
@
3 months ago
Replying to westonruter:
For example, caching plugins don't need to do any processing at all. They just need to capture the output buffer to put in the persistent object cache (for example) and maybe append an HTML comment to say that it was cached.
That's what I covered with my note on having an action for the raw string, but not making it filterable, to discourage problematic patterns as mentioned.
Also, some applications on the output buffer would only need the lighter-weight HTML Tag Processor which doesn't have all of HTML's complicated parsing rules internalized, so such extensions shouldn't be required to use it.
Maybe I'm missing something. Can you clarify what do you mean by lighter-weight HTML Tag Processor? What class is that, compared to what other class?
Note that Optimization Detective uses a subclass of
WP_HTML_Tag_Processor
so it wouldn't be able to use a single instance supplied by core anyway.
Couldn't this be handled by e.g. a decorator pattern? Alternatively, you mentioned it should eventually use the Core class anyway.
#23
in reply to:
↑ 22
;
follow-up:
↓ 25
@
3 months ago
Replying to flixos90:
Replying to westonruter:
For example, caching plugins don't need to do any processing at all. They just need to capture the output buffer to put in the persistent object cache (for example) and maybe append an HTML comment to say that it was cached.
That's what I covered with my note on having an action for the raw string, but not making it filterable, to discourage problematic patterns as mentioned.
That could work, but some things commonly done by caching plugins wouldn't be supported, like adding an HTML comment at the end of the response.
Also, some applications on the output buffer would only need the lighter-weight HTML Tag Processor which doesn't have all of HTML's complicated parsing rules internalized, so such extensions shouldn't be required to use it.
Maybe I'm missing something. Can you clarify what do you mean by lighter-weight HTML Tag Processor? What class is that, compared to what other class?
The HTML API has two classes: WP_HTML_Tag_Processor and WP_HTML_Processor. The latter is a subclass of the former which adds awareness of all of HTML's complicated parsing rules. In many cases, the desired HTML processing can use WP_HTML_Tag_Processor
, for example to iterate over to a given IMG
tag to apply mutations. But to have full awareness of the structure of the tags in an HTML document, the more robust WP_HTML_Processor
should be used. It is a superset and has more capabilities, but it should only be used if it is needed since it is more expensive to use. See @dmsnell's short summary in Updates to the HTML API in 6.6:
"The Tag Processor was initially designed to jump from tag to tag, then it was refactored to allow scanning every kind of syntax token in an HTML document. Likewise, the HTML Processor was initially designed to jump from tag to tag, all the while also acknowledging the complex HTML parsing rules."
Note that Optimization Detective uses a subclass of
WP_HTML_Tag_Processor
so it wouldn't be able to use a single instance supplied by core anyway.
Couldn't this be handled by e.g. a decorator pattern? Alternatively, you mentioned it should eventually use the Core class anyway.
Optimization Detective could eventually use the HTML Processor instead which should indeed eliminate most of the need for subclassing, but there are a couple capabilities blocking this:
- Insert HTML at an arbitrary point (e.g. in the
HEAD
and at the end ofBODY
). - Obtain the node sibling index for breadcrumbs (e.g. this
DIV
is the 4th element child).
OD's subclass also introduces helper methods like get_xpath()
, set_meta_attribute()
, and set_attribute()
/remove_attribute()
are overridden to add meta attributes to indicate how the attributes were mutated.
But also, other applications wouldn't need a tag processor at all, as I mentioned above with hoisting styles from the footer to wp_head
(e.g. implemented be printing a placeholder comment that gets replaced in the output buffer). Going the opposite extreme, other applications may want to load the entire HTML document into the DOM (e.g. the AMP plugin), especially as PHP 8.4's new Dom\HTMLDocument
is fully HTML5 compliant, in order to do much more advanced mutations of the document.
This ticket was mentioned in PR #8412 on WordPress/wordpress-develop by @westonruter.
3 months ago
#24
- Keywords has-patch added
This PR introduces output buffering of the rendered template starting just before the template_redirect
action. The output buffer callback then passes the buffered output into the wp_template_output_buffer
filter for processing. This is reusing the same output buffering logic that was developed for Optimization Detective and Gutenberg's Full Page Client-Side Navigation Experiment.
Examples for how this can be used:
- Always Load Block Styles on Demand: In classic themes a lot more CSS is added to a page than is needed because when the HEAD is rendered before the rest of the page, so it is not yet known what blocks will be used. This can be fixed with output buffering.
- Always Print Script Modules in Head: In classic themes script modules are forced to print in the footer since the HEAD is rendered before the rest of the page, so it is not yet known what script modules will be enqueued. This can be fixed with output buffering.
- Gutenberg's Full Page Client-Side Navigation Experiment: No longer would it need to start its own output buffer, but it could just reuse the
wp_template_output_buffer
filter. - Optimization Detective: The plugin would also be able to eliminate its output buffering, in favor of just reusing the
wp_template_output_buffer
filter. - Caching plugins would also not need to output buffer the response, but they could reuse the filter to capture the output for storing in a persistent object cache while also appending some status HTML comment.
- Other optimization plugins (e.g. WP Rocket, AMP, etc) would similarly not need to do their own output buffering.
Trac ticket: https://core.trac.wordpress.org/ticket/43258
#25
in reply to:
↑ 23
;
follow-ups:
↓ 26
↓ 29
@
3 months ago
Replying to westonruter:
That could work, but some things commonly done by caching plugins wouldn't be supported, like adding an HTML comment at the end of the response.
There's ways to address this, such as providing specific extension points to add HTML comments before or after the output (this would only allow comments, not any HTML as that could easily break the response).
Also, some applications on the output buffer would only need the lighter-weight HTML Tag Processor which doesn't have all of HTML's complicated parsing rules internalized, so such extensions shouldn't be required to use it.
Thanks for clarifying the differences. If we wanted to make this possible, we could run two action hooks, one for each. Then extenders can choose what works best for their purpose, yet still the API wouldn't allow them to go for problematic patterns like regexes.
Going the opposite extreme, other applications may want to load the entire HTML document into the DOM (e.g. the AMP plugin), especially as PHP 8.4's new
Dom\HTMLDocument
is fully HTML5 compliant, in order to do much more advanced mutations of the document.
Sure, there are always cases for everything - but that doesn't mean they all should be encouraged by the APIs provided by Core.
At the end of the day, plugin developers will do whatever they need to get the job done - whether Core's APIs support it or whether they need to work around it. If we have an API that allows anything, it avoids the need to work around it. But at the same time it's a wildcard where anyone can do whatever they want very easily, like even wipe the entire output.
FWIW I'm just thinking out loud with my above ideas of multiple actions for specific integration points, there may be more elegant solutions. But I think for an API as powerful as this (for both good and bad), we need to have guardrails in place instead of just opening everything up - that sets us up for chaos. For some other APIs being less strict is not so bad, but this can alter the entire HTML output so it's a different level of risk.
I think at the very least, we shouldn't allow filtering the string, but modifications should go through an actual API where WordPress Core retains central control over the output. For example there could be a new class that receives the HTML string and provides methods to modify it (e.g. through one of the HTML tag processor classes or in other ways), and that class instance could be made available through an action.
#26
in reply to:
↑ 25
;
follow-up:
↓ 27
@
3 months ago
First off, thanks for reviving this ticket!
Replying to flixos90:
At the end of the day, plugin developers will do whatever they need to get the job done - whether Core's APIs support it or whether they need to work around it. If we have an API that allows anything, it avoids the need to work around it.
Exactly. Because we don't have to work around it, it would result in a cleaner codebase.
I think at the very least, we shouldn't allow filtering the string, but modifications should go through an actual API where WordPress Core retains central control over the output. For example there could be a new class that receives the HTML string and provides methods to modify it (e.g. through one of the HTML tag processor classes or in other ways), and that class instance could be made available through an action.
Allowing the string to be filtered, would make the lives of us, developers of optimization or slider plugins, easier as there's only one point of entry and therefore, one point of error. Right now, we often need to implement compatibility fixes because one plugin's buffer conflicts with another.
As for the part about using Regex to manipulate the HTML; that's because of the point that @westonruter already mentioned: DOMDocument currently doesn't handle HTML5 properly, and since we need to be backwards compatible back to 7.2 (if we follow WP Core's example) we can't even use PHP 8.4 DOM\HTMLDocument for the next several years (until WP drops support for PHP 8.3). In short, currently using a regex is the most reliable (and faster) way to manipulate HTML.
#27
in reply to:
↑ 26
@
3 months ago
Replying to DaanvandenBergh:
I think at the very least, we shouldn't allow filtering the string, but modifications should go through an actual API where WordPress Core retains central control over the output. For example there could be a new class that receives the HTML string and provides methods to modify it (e.g. through one of the HTML tag processor classes or in other ways), and that class instance could be made available through an action.
Allowing the string to be filtered, would make the lives of us, developers of optimization or slider plugins, easier as there's only one point of entry and therefore, one point of error. Right now, we often need to implement compatibility fixes because one plugin's buffer conflicts with another.
As for the part about using Regex to manipulate the HTML; that's because of the point that @westonruter already mentioned: DOMDocument currently doesn't handle HTML5 properly, and since we need to be backwards compatible back to 7.2 (if we follow WP Core's example) we can't even use PHP 8.4 DOM\HTMLDocument for the next several years (until WP drops support for PHP 8.3). In short, currently using a regex is the most reliable (and faster) way to manipulate HTML.
Regular expressions aren't reliable actually. This is why WP_HTML_Tag_Processor
and WP_HTML_Processor
were introduced in core as part of the HTML API starting in WP 6.2. I strongly recommend you look at switching. See posts tagged html-api for more details.
If the output buffer is filterable as a string, the filter documentation should heavily discourage the use of regex to parse the output in favor of the HTML API.
#28
@
3 months ago
I would certainly be interested in using a core-provided alternative to the output buffer, but I would not want to (have to) switch my entire and "battle-hardened" regex-based codebase to the HTML API to be very honest, in that case I would have to stick with good old ob_* ... :-/
#29
in reply to:
↑ 25
;
follow-up:
↓ 31
@
3 months ago
Replying to flixos90:
Also, some applications on the output buffer would only need the lighter-weight HTML Tag Processor which doesn't have all of HTML's complicated parsing rules internalized, so such extensions shouldn't be required to use it.
Thanks for clarifying the differences. If we wanted to make this possible, we could run two action hooks, one for each. Then extenders can choose what works best for their purpose, yet still the API wouldn't allow them to go for problematic patterns like regexes.
As seen in my examples, Always Load Block Styles on Demand and Always Print Script Modules in Head, certain optimizations don't need the overhead of a tag processor. If, for example, an HTML comment placeholder is printed at wp_head
then this can be processed with a simple string replacement (not regex).
There's also the issue of being able to use extended processor subclasses. If core only allowed you to use either WP_HTML_Tag_Processor
or WP_HTML_Processor
specifically, then if a plugin wanted to instead use a subclass of either then they wouldn't be able to.
I think the output buffer string should be filterble, with documentation that advises against the use of regex, but at the same time doesn't somehow prevent it. If the API is too restrictive, developers will just resort to doing their own output buffering as they are today (as mentioned by you and @OptimizingMatters). WordPress isn't in full control of the output today anyway, and without having a central core-supported filter for the output-buffered there is extreme fragmentation with how plugins handle output buffer processing. By having a single output buffer and filter, there can be more consistency in how output buffering is handled.
#30
@
3 months ago
I just added a suggestion to my PR that after the wp_template_output_buffer
filter has applied there should actually be an action like wp_final_template_output_buffer
which fires and is passed the final output buffer string as its argument. This is the action that caching plugins should use to capture the output for storage. It wouldn't be good for caching plugins to rely on the filter to capture the output since there could be another plugin that adds a later filter which changes it somehow, and then there could be a war of action priorities. Using a filter just to capture a value without making any changes is also doesn't seem like the right application of filters.
#31
in reply to:
↑ 29
@
3 months ago
Replying to westonruter:
I think the output buffer string should be filterble, with documentation that advises against the use of regex, but at the same time doesn't somehow prevent it.
This seems like a sensible approach to me.
This ticket was mentioned in Slack in #core by audrasjb. View the logs.
3 months ago
#33
@
3 months ago
- Milestone changed from 6.8 to 6.9
As per today's bug scrub: It appears this ticket is still under discussion. As 6.9 is very close, I'm moving it to 6.9. Feel free to move it back to 6.8 if it can be committed by Monday.
#34
@
3 months ago
Keen for this. There are a ton of scenarios where I need to be able to modify things like HTTP headers and the <head> of a document, based on that document's final content (for SEO, performance, accessibility and various other reasons). That's extremely cumbersome at the moment, given the myriad ways in which (and points in time that) themes, blocks and content can be input, transformed, and output.
Having a reliable, safe way to use output buffering would make developing features in these areas far easier.
Anecdotally, when I was at Yoast, we had a laundry list of powerful block editor SEO features which never got past the drawing board because output buffering is/was nasty at the time. If we can fix that, we can do so much more with blocks.
#36
follow-up:
↓ 37
@
2 months ago
Thanks everyone for pushing this issue forward. As most of you are probably aware, Automattic has generally paused contributions to Core, so I am unable at this time to interact more adequately on this issue. Still, here are some basic thoughts from my end:
We want to be careful that we only provide semantic HTML filtering to HTML outputs. That means excluding the filter from JSON outputs and RSS outputs and XML-RPC/SOAP outputs and any other XML output. There may be ways to more broadly filter HTML content on its way out of WordPress, however, with respect to output buffering I don’t believe the primitives are in place to make this smooth. Likely important is some global $content_type
variable indicating the output, as well as new filters in the right places. I’ll come back to this. More broadly Core has what I think is a problem with content provenance of various kinds that are relevant to these designs.
The more I use the HTML API in practice the less concerned I am about relying on the full-blown HTML Processor. This is because it occurs so frequently that we need full HTML parsing that we might as well start with that. In other words, if we end up with two output buffers: a fast Tag Processor pass and a slow HTML Processor pass, then we might as well skip the fast one because we’ll be doing the slow one anyway. If we wanted to, this same process could normalize the HTML leaving the server to provide well-formed documents, though there’s no real need to do this since browsers do anyway. The point is that ten filters on one HTML Processor filter pass is going to be faster than six filters on a Tag Processor and four on an HTML Processor.
For HTML processing I think it’s likely more important to avoid exposing the raw HTML. Some plugins will want this, that’s fine. But Core can likely do a much better job designing and HTML-semantic output buffering pipeline. That is, perhaps Core exposes things like “when reaching IMG
tags let me modify its attributes”. I think this is a reasonable place to add a class as the filter so that we can rely on native methods for dispatching the potential extension points — something akin to Python’s HtmlParser
class instead of exposing numerous specific filters that take separate functions.
And this brings us back to the content type. If we expose the right filters we won’t have to worry about content since we can run the semantic filters on the full output buffer for HTML-output cases — no need to pass in HTML as a string — but also we can run it on any HTML destined for inclusion inside XML of JSON. My own work has demonstrated that it’s possible for us to reliably convert HTML into XML for things like RSS/Atom feeds where XML is able to express the HTML. This means that these same filters could provide extensibility for non-HTML outputs through an HTML interface. This is going to be a challenge if we go the semantic route, because if we don’t address it then API responses will return different content than page renders, for example.
I would not want to (have to) switch my entire and "battle-hardened" regex-based codebase to the HTML API to be very honest
Your plugin does a lot of HTML stitching and everyone’s invited to do their own thing — stitching is still a developing part of the HTML API. Reliability is not the concern with the HTML API though. Like your plugin, Core is full of examples of “battle hardening,” but these usually cover known patterns and fail in an array of common cases. I will not point out any specific cases, but I saw the same characteristic regex issues in autoptimize as I’ve seen basically everywhere. Regex‘s are easy, but the HTML API will not mis-parse because it was designed around the spec instead of input examples.
If you get curious, you can subclass the HTML API for more direct control over the kinds of operations you are doing with regexes. The API offers a hierarchy of opt-in risk based on your tolerance for parsing issues and exploits and can do way more than it appears; because safety and reliability were the highest design priorities.
#37
in reply to:
↑ 36
@
2 months ago
Replying to dmsnell:
Thanks everyone for pushing this issue forward. As most of you are probably aware, Automattic has generally paused contributions to Core, so I am unable at this time to interact more adequately on this issue. Still, here are some basic thoughts from my end:
Thank you for taking the time!
We want to be careful that we only provide semantic HTML filtering to HTML outputs. That means excluding the filter from JSON outputs and RSS outputs and XML-RPC/SOAP outputs and any other XML output. There may be ways to more broadly filter HTML content on its way out of WordPress, however, with respect to output buffering I don’t believe the primitives are in place to make this smooth. Likely important is some global
$content_type
variable indicating the output, as well as new filters in the right places. I’ll come back to this. More broadly Core has what I think is a problem with content provenance of various kinds that are relevant to these designs.
I don't believe introducing a global $content_type
is necessary because we can look at the Content-Type
header that WordPress has sent. For example:
<?php function od_is_response_html_content_type(): bool { $is_html_content_type = false; $headers_list = array_merge( array( 'Content-Type: ' . ini_get( 'default_mimetype' ) ), headers_list() ); foreach ( $headers_list as $header ) { $header_parts = preg_split( '/\s*[:;]\s*/', strtolower( $header ) ); if ( is_array( $header_parts ) && count( $header_parts ) >= 2 && 'content-type' === $header_parts[0] ) { $is_html_content_type = in_array( $header_parts[1], array( 'text/html', 'application/xhtml+xml' ), true ); } } return $is_html_content_type; }
In an output buffer, this can be paired with checking for the first non-whitespace character being <
:
<?php // If the content-type is not HTML or the output does not start with '<', then abort since the buffer is definitely not HTML. if ( ! od_is_response_html_content_type() || ! str_starts_with( ltrim( $buffer ), '<' ) ) { return $buffer; }
For HTML processing I think it’s likely more important to avoid exposing the raw HTML. Some plugins will want this, that’s fine. But Core can likely do a much better job designing and HTML-semantic output buffering pipeline. That is, perhaps Core exposes things like “when reaching
IMG
tags let me modify its attributes”. I think this is a reasonable place to add a class as the filter so that we can rely on native methods for dispatching the potential extension points — something akin to Python’sHtmlParser
class instead of exposing numerous specific filters that take separate functions.
I'd love to see more of what you have in mind here. I know you've advised against passing around instances of WP_HTML_Processor
/WP_HTML_Tag_Processor
as callbacks for functions, so I understand you're wanting a higher level abstraction that extensions interface with. A couple of the use cases I have are for optimizing PICTURE tags or Embed blocks both which require walking over the children. I have a list of other such optimizations built with the HTML Tag Processor.
If you get curious, you can subclass the HTML API for more direct control over the kinds of operations you are doing with regexes. The API offers a hierarchy of opt-in risk based on your tolerance for parsing issues and exploits and can do way more than it appears; because safety and reliability were the highest design priorities.
Being able to subclass WP_HTML_Processor
would seem to conflict with using a single instance for processing the output buffer. Sure we could introduce a filter like wp_rest_server_class
for allowing plugins to introduce their own subclass for the output buffer processing, but then if multiple plugins want to each use their own subclass then they're out of luck since only one can win.
#38
@
2 months ago
I've updated the drafted PR to look at the content type for the response, and if it is HTML, then it applies a wp_output_buffer_html
filter. (Currently, if the output is not HTML then no filter applies.) By having a dedicated filter just for the HTML response we avoid situations where a template, for example, returns a non-HTML content type (such as in the case of serving robots.txt or feeds), and then a filter corrupts the response assuming it is HTML.
I also added an wp_final_output_buffer
action which is passed the final output buffer after filtering, regardless of the content type. This can be used by caching plugins to stash the response for future serving.
That sounds a lot like treating the symptoms not the cause.