#59544 closed defect (bug) (fixed)
Performance: Skip iteration in block supports elements when not necessary.
Reported by: | dmsnell | Owned by: | flixos90 |
---|---|---|---|
Milestone: | 6.4 | Priority: | normal |
Severity: | normal | Version: | 6.4 |
Component: | Editor | Keywords: | has-patch dev-feedback |
Focuses: | performance | Cc: |
Description
Quick notes
- This is intended to address performance regressions in the 6.4 beta release in combination with the front page of
twentytwentyfour
. It's half a bug-fix and half an enhancement. - It needs a backport into Gutenberg because it comes over in a Gutenberg package. I am going to ensure this happens as it needs to, but given the relation to the release, I would prefer we fast-track this into Core first if it finds favorable review and confirmed impact.
Description
Previously there have been three ways in which the block supports elements code has been performing computation it doesn't need to.
- It checks for every possible style attribute on a block even if there are no possible style attributes available on the block.
- Once it determines that it needs to add a special class to a block it continues scanning through the rest of the attributes.
- It constructs dot-separated literal key paths for the attributes and immediately and repeatedly calls
explode()
on them to create an array at runtime.
The combination of these three factors leads to an outsized impact of this function on the runtime and memory pressure for what it's doing. This patch removes all three of these inefficiencies:
- If no elements style attributes exist on the block it immediately returns and skips all further processing.
- It stops scanning block attributes once a single one is found that imposes the requirement for the extra class name.
- It constructs array literals for the key path instead of strings. This removes the need to artificially join and explode the keys at runtime.
There should be no functional or visual changes in this patch, but it should reduce the overall runtime of page renders and require less memory while rendering. The impact will be proportional to the number of blocks rendered on a page.
Change History (46)
This ticket was mentioned in βPR #5411 on βWordPress/wordpress-develop by β@dmsnell.
15 months ago
#1
This ticket was mentioned in βSlack in #core-performance by dmsnell. βView the logs.
15 months ago
#4
@
15 months ago
This is intended to address performance regressions in the 6.4 beta release in combination with the front page of
twentytwentyfour
. It's half a bug-fix and half an enhancement.
Does this fix the performance regression noted and being tracked in #59465?
If yes, what part of it should be considered a regression fix? What part of the changes are enhancements that do not directly contribute to the regression fix?
#5
@
15 months ago
There are some profiling results here. See the βprofile.
This ticket was mentioned in βSlack in #core by hellofromtonya. βView the logs.
15 months ago
#8
@
15 months ago
@spacedmonkey Aha that ticket - yes thanks for linking it. That's the one I was searching for.
I believe it will help the regression spotted in #59443.
By "help", does it resolve the root cause of the regression noted in #59443? Or is it performance tuning?
Also I added @spacedmonkey @joemcgill and you @spacedmonkey as reviewers for the PR to confirm the changes and performance fix.
#9
@
15 months ago
Adding more context:
Prior to this ticket being opened, it was discussed in βMake/Core slack #core-performance
channel.
#10
@
15 months ago
- Milestone changed from Awaiting Review to 6.4
I'm moving this ticket into 6.4 milestone for awareness and continued discussion. For now, my understanding is: it either fixes or partially helps with a performance issue introduced in the 6.4 cycle (which is why #59443 was reopened).
#11
@
15 months ago
- Keywords dev-feedback added
Noting: There's an open discussion to better understand #59443.
I'll add dev-feedback
to denote this ticket is not ready for commit consideration into the 6.4 cycle pending review from @spacedmonkey @flixos90 @joedolson @clarkeemily to advise.
Would also like to see performance benchmarks to confirm it does fix the regression and PR approvals.
β@flixos90 commented on βPR #5411:
15 months ago
#12
@dmsnell This is a great catch, and IMO worth committing for the 6.4 release.
I ran some additional benchmarks to further validate the benefits outlined by your initial benchmarks in the PR description, based on 100 benchmark-server-timing
runs. To rule out the changes being caused by variance, I did that multiple times, giving additional confidence on the data. Summary of the median wp-total
values:
- Against TT4 home page:
- 88.43ms (PR) against 89.73ms (trunk) --> 1.4% faster
- 88.23ms (PR) against 89.93ms (trunk) --> 1.9% faster
- 88.59ms (PR) against 89.89ms (trunk) --> 1.4% faster
- Against TT4 "Sample Page":
- 67.49ms (PR) against 67.6ms (trunk) --> 0.2% faster
- 67.39ms (PR) against 68.2ms (trunk) --> 1.2% faster
- 67.21ms (PR) against 67.91ms (trunk) --> 1.0% faster
- Against TT3 home page:
- 62.91ms (PR) against 64ms (trunk) --> 1.7% faster
- 63.19ms (PR) against 63.68ms (trunk) --> 0.8% faster
- 63.86ms (PR) against 63.98ms (trunk) --> 0.2% faster
Per the above, it is clear that the benefits of this PR are particularly relevant the more complex the current page / blocks are. Even for the very simple "Sample Page" though, there is a net benefit visible, so it's a safe performance improvement. If we were making the decision based on very simple demo content, I'd say this is not _critical_ to merge, but with lots of real-world content more complex than that, I think it's safe to say this makes sense to add, even this late in the cycle. π
β@dmsnell commented on βPR #5411:
15 months ago
#13
Would you be able to add 1-2 unit tests for this function to cover the changes?
If I do this @felixarntz we'll all be waiting longer than anyone wants. I didn't write this function and I have very little knowledge of how to test it or what is important to test. My refactor has centered on code auditing to avoid changing any behaviors except for the runtime performance.
Maybe @jorgefilipecosta or @aaronrobertshaw could chime in on this.
#14
@
15 months ago
- Type changed from enhancement to defect (bug)
As this ticket is intended to fix or partially fix a performance issue introduced in the 6.4 cycle, IMO it falls into the bugfix category rather than enhancement.
Sharing the same caution I did in similar performance fix ticket:
Caution: During beta, there's a fine line between fixing performance and enhancing for refinement. Beta is for fixing issues.
The changes should not introduce new functionality or change the output / result of the enhancement. Rather, it should focus on resolving the performance issue.
#15
@
15 months ago
Correct, @hellofromTonya, and thanks for static. This is why I've tried to limit the scope of the patch (discussed βin the PR) so that it's change is as trim and auditable as possible. I'm currently creating additional requested changes, but I prefer we go with that's in the patch without the additional changes.
between fixing performance and enhancing for refinement
For context, what led me here is the performance issues with the combination of the beta release and the twentytwentyfour
home page, specifically a render with lots of blocks. Without removing those blocks and changing twentytwentyfour
(which in a sense would hide the performance issues we're discussing) I started looking to see if there were parts of the code that specifically make renders with lots of content slow, and this function both stood out and was doing obviously needless work.
I still haven't seen clear evidence that any one thing is responsible for the performance issues other than that Core is slow when rendering as much complex context as twentytwentyfour
does, as I think we're seeing the exacerbation of many different performance problems that individually aren't major. This patch is specifically cutting one of those exacerbations in the hot path, which so far seems to be the actual render_block()
portion in my experiments and profiling.
This ticket was mentioned in βSlack in #core-performance by dmsnell. βView the logs.
15 months ago
β@dmsnell commented on βPR #5411:
15 months ago
#17
All, I have pushed a new commit to this branch which separates the question of whether a block needs the extra class name from the act of adding it. This new function incorporates all of the further practical optimizations.
- it avoids creating arrays, except for the static list of heading elements that I thought was reasonable enough.
- it aborts as soon as it can to avoid computing any needless lookups
As this function diverges a lot from the existing code I'm still leery of the change in this release cycle, also I'm a bit skeptical on the additional performance impact, so I would appreciate several people testing out to see if it's worth it. This change, in my opinion, makes it a little more fragile for future additions and changes to the rules because they are now encoded more in code structure than in a data list by itself. I have no foot in the game, though, so this is just me sharing my experience on the matter and not a personal wish.
β@dmsnell commented on βPR #5411:
15 months ago
#18
I've updated the description with some initial benchmarking against the extra commit. While I think these are all fairly close, I think the addition of the function call made the change slower. I'll inline it, but I think these results are matching my intuition, that the performance gains were made and the rest is in the blurry boundary.
β@dmsnell commented on βPR #5411:
15 months ago
#19
Funny: I inlined the function and the performance results got worse. Optimization is hard on modern CPUs. Maybe some speculation was winning an advantage with the array()
structure somehow. I'm not sure. Or maybe the JIT found a better path π€·ββοΈ
As it stands this PR is faster at the first commit and slower at the second. So please, maybe we just stick with the initial proposal which was both a smaller and clearer change but also probably faster in practice?
β@flixos90 commented on βPR #5411:
14 months ago
#20
@dmsnell How did you verify that the performance got worse? If you used the benchmark-server-timing
command, I'm not sure it's suitable for this kind of optimization. There is a considerable amount of variance, even with 100+ runs. It's good to rely on when the difference is substantial enough and consistently leaning in the one direction, but I am pretty sure that none of the additionally proposed improvements have any impact that would bring visible results with that command.
That doesn't mean those improvements are incorrect. For example, computing the class name from the function outside the loop surely is more efficient than for each entry in the loop, even if it doesn't show in benchmarks. All of the additionally proposed points of feedback are micro optimizations, which we probably just need to individually reason about. While in an ideal world we could even spot those with benchmark-server-timing
, the variance in that command is too high to help with micro optimizations.
Anyway, all that is to say: I'm not disagreeing with your conclusion, but I think we need to be mindful of not interpreting too much into benchmark-server-timing
results when it comes to tiny differences (>1%). In those cases, I would recommend to run the command several times, and if the tendency is always in the same direction, there's a good chance there really is an improvement (or a regression, in the inverse case). But otherwise, the benchmarks simply don't help in that case.
β@flixos90 commented on βPR #5411:
14 months ago
#21
@dmsnell
Would you be able to add 1-2 unit tests for this function to cover the changes?
If I do this @felixarntz we'll all be waiting longer than anyone wants. I didn't write this function and I have very little knowledge of how to test it or what is important to test. My refactor has centered on code auditing to avoid changing any behaviors except for the runtime performance.
I'm not sure that's a reasonable path forward. The fact that this refactor aims to avoid changing any behaviors is precisely why it should have test coverage, to ensure that that is the case, and remains so in the future. Yes, writing tests takes extra time, and I get that it sometimes feels like it slows things down. But it's crucial in the WordPress core project that code has test coverage.
Having test coverage also makes committing changes less risky if the committer can rely on existing test coverage passing as another indicator that a change is good to go. I think this function can reasonably be tested with PHPUnit, and it probably should have had tests added already whenever it was introduced. But since we're changing it now, it's crucial we at least follow up with that now.
β@dmsnell commented on βPR #5411:
14 months ago
#22
How did you verify that the performance got worse? If you used the benchmark-server-timing command, I'm not sure it's suitable for this kind of optimization.
This is indeed how I measured, with the specific commands and environment documented above in the description. I ran with 3000 iterations and have seen fairly consistent results each time I run it.
My point isn't that this does or doesn't have micro optimizations, but that if it's this difficult to tell whether it helps, hurts, or makes no difference, then it's not worth the complexity and readability issues that it requires. I'm given to believe that something underpins the fact that it appears that these additional micro-optimizations slow things down, strange as it seems.
Regardless, the changes introduced in the first commit reliably appear in various benchmarks and measurements, so those are a clear win without a clear readability penalty. I generally perfer to get _better_ out the door before we can confirm if _best_ is better.
computing the class name from the function outside the loop surely is more efficient than for each entry in the loop, even if it doesn't show in benchmarks.
Also noted βin the thread above but this isn't getting computed multiple times. As soon as that part of the loop is reached the function returns, so this is only ever being called once. There was an initial misunderstanding of the code that led to this suggestion, but it doesn't apply here.
I'm not sure that's a reasonable path forward.
Pardon for the misunderstanding! I'm not saying I won't add tests; and unfortunately unless someone comes in to help it may be _unreasonable_ but it's the only thing I can offer π
Also worth mentioning: if I add tests someone is going to have to review the tests to make sure I'm testing the right things π _Quis custodiet ipsos custodes?_ I'm flying blind here. I'm not sure I can get these in this week, or even by end of next week. I'll try.
@aaronrobertshaw are you at all available to help speed this up?
β@flixos90 commented on βPR #5411:
14 months ago
#23
Thanks for the detailed clarification @dmsnell!
@aaronrobertshaw If you can assist here with test coverage, that would be tremendously helpful. I am happy to review, will have to see if alternatively I can take a stab at it as well.
β@dmsnell commented on βPR #5411:
14 months ago
#24
@felixarntz still relying on the wpp-research
tooling to measure, I setup the tests on an isolated server in a climate-controlled data center, sharing no resources with other systems. the results agree with my local testing.
- 2.0 GHz EPYC Milan, 16 GB
npm run research -- benchmark-server-timing -u 'http://localhost:3880' -n 5000 -c 30 -p -o csv
- PHP and mysql are both running natively, outside of any container
version | wp-total (p50) | % against trunk
|
trunk | 124.9 | - |
first commit only | 119.0 | -4.73% |
with micro-optimizations | 121.2 | -2.96% |
I haven't listed the other numbers because they correspond with the results shared in the description. I'm happy to see other results show otherwise, but I think something was lost my attempts to realize the additional suggestions. these results each stem from 5000 requests and are relatively consistent when running multiple times.
β@aaronrobertshaw commented on βPR #5411:
14 months ago
#25
Thanks for the pings π
I'm happy to help wherever I can. Just catching up on the changes and discussion before looking into tests.
β@aaronrobertshaw commented on βPR #5411:
14 months ago
#26
Quick update:
I'm working on some additional tests and I'll put together PRs for both core and Gutenberg.
The plan is to leave the existing wp_render_elements_support
tests untouched for the moment and simply add coverage to ensure that the correct styles are being generated for elements via the style engine.
It'll take a little bit to get its sorted but it'll happen in the next couple of hours.
β@dmsnell commented on βPR #5411:
14 months ago
#27
Quick note here @felixarntz: in further testing on my server I'm getting results all over the place, which I wasn't earlier. I'm not sure what's going on, unless the server is just more variable than my laptop. Maybe I can make my own benchmark with curl
and build a more statistically sound analysis.
Any suggestions otherwise? I would have hoped that over thousands of runs this would converge, but it would also be nice to see a histogram of the results which maybe tells a clearer picture.
Needless to say though, we're clearly doing less work by aborting early, so I don't doubt any of the existing conclusions. Maybe the additional optimizations are more effective than I think, or maybe they aren't π€·ββοΈ. I'm sure there's no point in continuing to scan attributes though when we've already determined we need to add the class, and explode()
was already creating arrays for the paths, whereas even in the first commit, these arrays still exist, it's strictly equal or better in those terms.
β@dmsnell commented on βPR #5411:
14 months ago
#28
and some plots:
first up is a raw scatter plot of 3000 runs each of Server-Timing
based on manually extracting the values, still on that server mentioned above. outliers are cut out of the graph, but they do not significantly change the distribution. it's interesting to note just how patterned the variation is.
I used node
to quickly extract the timings from each file.
for a different perspective, here are the runs for each branch when sorted.
not much but mystery I guess
ggbetweenstats
thinks there's a significant difference though both from trunk
and with the extra optimizations.
branch | standard deviation |
trunk | 28.1 |
first commit | 10.2 |
all commits | 6.67 |
strangely this is showing a performance regression in all this work π€¦ββοΈ but I still think there's bias slipping in on the server. I will see if I can re-run this overnight while alternating branches to account for any systematic issues with the server; maybe it's not as isolated as I have thought.
β@aaronrobertshaw commented on βPR #5411:
14 months ago
#29
There's a PR up in βhttps://github.com/WordPress/wordpress-develop/pull/5418 adding elements block support unit tests. The equivalent Gutenberg PR (βhttps://github.com/WordPress/gutenberg/pull/55113) has been merged there.
β@dmsnell commented on βPR #5411:
14 months ago
#30
really splitting hairs now, but here is the same plot based on around 11,000 points for each commit. every 500 samples I switched branches to spread out pattern bias in the server.
this keeps on confirming I guess what we've already concluded: the first commit brings a significant but small improvement over trunk
, while the additional changes for some reason, surprising as they are, regress from the first commit but are still faster than trunk
β@dmsnell commented on βPR #5411:
14 months ago
#31
Last night was fun: I have a few more charts to share, and I wouldn't have continued on this except I feel like it will be useful to have some better testing infrastructure for myself for other work and other measurements. On the aforementioned server and also on an old Core i9 MBP, 64 GB I setup the same experiment and collected 50,000 data points for each of the three commits in question: trunk
at the merge base, 26f3d589, and a805b240. I made sure to manually control the fan on the MBP to ensure that it didn't thermally throttle (which it did early on before I set the fan and then I had to restart the experiment because the numbers shot straight up).
What did I find? More of the same, really, but a surprise with the laptop.
source | raw data | stats comparing data |
---|---|---|
server | ||
laptop |
what's really interesting to me is the strongly bi-modal distribution of request timings on the laptop. I'm not sure why that is, but I think it might explain somewhat how at times it can seem like our performance results can flip in different test runs. I haven't confirmed if this behavior is present on my Apple silicon laptop yet, so it could be something related to the PC hardware and architecture. I think the final piece I'd like to figure out, and this is something I _don't_ plan on doing for this PR, is to run these experiments on some free shared WordPress hosts if I can so that I can obtain a better grasp of how some of these performance changes impact the "everyman's WordPress hosting" - the kinds of hosts that are much more typical for WordPress than high-end developer machines or dedicated servers. sometimes I wonder if the changes we make help the best environments at the cost of those installs running on spinning disks, shared caches, and strained databases.
overall though this does nothing to add to the earlier experiments demonstrating that the first commit alone is not only a much smaller patch, but faster than the additional changes. my recommendation remains therefore that we ought to consider merging just the first commit and leave any potential micro-optimization for the normal release cycle. that being said I'm rerunning the experiment with @spacedmonkey's latest suggestion to inline the array of heading elements. happy to take further recommendations, but also happy to take the easy win and leave the muddy parts for a better time.
β@flixos90 commented on βPR #5411:
14 months ago
#32
Thank you for this super detailed performance research @dmsnell, it's indeed interesting (and confusing!) to see those results.
Unfortunately, there is still _a lot_ for us to uncover and improve regarding the performance benchmarking tooling, and at the moment the variance is still relatively high. For Server-Timing, it generally produces quite consistent results for me - but when I say consistent, that is still within a 1-2% margin of error/variance when comparing medians from multiple benchmark-server-timing
calls.
That's why for now I use those benchmarks mostly as indicator for broader changes to see whether they make a notable improvement and, if so, how notable it is.
For micro optimizations (basically anything that came after the initial commit), I don't trust it at all, quite frankly. For those optimizations, I think we'll still have to operate on general performance knowledge, discussion, etc. I think if in doubt over a micro optimization, our best bet is to use something like βhttps://3v4l.org/, run an operation using both of the potential approach let's say 100,000 times, and look at the difference to make an educated call. I would advise to rely on that instead of benchmark-server-timing
when considering micro optimizations.
β@spacedmonkey commented on βPR #5411:
14 months ago
#33
I think we need the following missing unit tests.
- Unit test for empty content passed to
$block_content
- Unit test for empty content passed to
$block
with empty'attrs'
- Unit test for a block that is not registered.
- Unit tests for buttons, headings and links.
β@dmsnell commented on βPR #5411:
14 months ago
#34
I think we need the following missing unit tests.
I can try and get these added when I bring over the tests that @aaronrobertshaw added. thanks again @aaronrobertshaw!
static
removing the static isn't appearing to show any benefit either over the first commit.
if in doubt over a micro optimization, our best bet is to use something like βhttps://3v4l.org/
we can do that @felixarntz but I would also doubt results from micro-benchmarks like that. it's too easy to construct synthetic benchmarks that remove the important context, such as the code that runs around this stuff. obviously if an impact is huge it won't matter, but it won't require creating an artificial test scenario that may not exist in production to prove a significant improvement.
run an operation using both of the potential approach let's say 100,000 times, and look at the difference to make an educated call. I would advise to rely on that instead of benchmark-server-timing when considering micro optimizations.
well I ran these 10,000 and while the distributions overlap there's also a strong statistical evidence that the differences are not attributed to random variation alone.
for the micro-optimizations I would be happy to see evidence that shows they improve things, but I'm trusting my measurements on this over my intuition. to that end I think it's best if I focus on porting the tests into a Core ticket and let y'all do with this what you want because there are bigger performance penalties to chase. so far, it seems like _all_ of the evidence we _have_ collected pushes one way, and the dubious hypothetical benefits come at a cost of more drastic code changes (and this is even a slightly a different scenario than it would be if all that extra complexity demonstrated in any way that it was even marginally faster, which it isn't). my language might appear strong here, but I don't have any strong feelings on where you take this. I'm only sharing my experience and experiments and trying to be clear about where my recommendations are coming from.
at the moment the variance is still relatively high
curious to hear more of your thoughts on why you doubt the tooling, or why you doubt it after running statistical sampling on it. N = 10000 is generally considered quite incredible, and in review of the raw data it looks fair between the branches, and pattern bias that was shown to impact the results in the short term were spread out among the branches when I identified it and changed the experiment.
I'd love to have a formalized way to compare the performance of the full site across branches so I welcome your input on how to improve the setup.
β@flixos90 commented on βPR #5411:
14 months ago
#35
FWIW I am happy with pretty much "whichever" version of this PR to commit it, since almost all feedback are performance nit-picks not worth blocking this.
I would suggest though that we wait until #5418 has been committed first, so that there is additional test coverage that helps verify that this change is "safe to make".
β@spacedmonkey commented on βPR #5411:
14 months ago
#36
For micro optimisation like this, profiling is much better than benchmark to see effect.
β@dmsnell commented on βPR #5411:
14 months ago
#37
previous PR HEAD
was b5207cb99c595c44a9db78e9dad9acf497b75a32
resetting to first commit only based on following the data
For micro optimisation like this, profiling is much better than benchmark to see effect.
I guess this is a good conversation for a different place, but if a test of a realistic environment conflicts with profiling why would we trust the profiling? we know that profiling itself impacts the runtime and can bias results, particularly any results where blocking time is a big part of the run. sampling profilers definitely help, but they still skew results.
all I really want to do is not waste time hunting for a micro-optimization that nobody can demonstrate is better or worse. that's such a precarious path to take.
I would suggest though that we wait until βhttps://github.com/WordPress/wordpress-develop/pull/5418 has been committed first, so that there is additional test coverage that helps verify that this change is "safe to make".
sounds great! I didn't realize we already had a Core ticket for those unit tests either. I'll turn this PR over to either of you, or whoever wants to push it through. in the meantime I'll try and follow the tests PR and see if I can add the additional requested tests.
do you need anything else from me on this?
β@flixos90 commented on βPR #5411:
14 months ago
#39
Just merged latest trunk
into here in βhttps://github.com/WordPress/wordpress-develop/pull/5411/commits/70e91de436dd75ca0f810b79e07cd288948463e2, which has the tests added from https://core.trac.wordpress.org/ticket/59555
β@flixos90 commented on βPR #5411:
14 months ago
#41
Committed in https://core.trac.wordpress.org/changeset/56807
This ticket was mentioned in βPR #5446 on βWordPress/wordpress-develop by β@dmsnell.
14 months ago
#42
- Keywords has-unit-tests added
Follow-up to #59544
Adds additional unit tests to confirm the behavior of elements block supports when certain conditions aren't met.
Trac ticket:
#43
in reply to:
βΒ description
@
14 months ago
- Keywords has-unit-tests removed
additional unit tests added in #59578. those tests raise a behavioral question which I have answered in that patch by modifying the behavior, but I'm happy to change the tests so that they encode the existing behavior.
β@dmsnell commented on βPR #5411:
14 months ago
#44
Thanks @felixarntz - I added additional unit tests in #5446
β@dmsnell commented on βPR #5446:
14 months ago
#45
thanks again @aaronrobertshaw - I made it easy and simply adopted your patch. we can see if @felixarntz and @spacedmonkey are happy with this, as I was simply trying to follow-up with their requirements from last week. if they are, I don't feel like I need any props on this patch since, although I offered some tests of my own flavor, this patch is neither motivated by own my work or the product thereof π
β@flixos90 commented on βPR #5446:
14 months ago
#46
Committed in https://core.trac.wordpress.org/changeset/56828
Trac ticket: #59544
Previously there have been three ways in which the block supports elements code has been performing computation it doesn't need to.
explode()
on them to create an array at runtime.The combination of these three factors leads to an outsized impact of this function on the runtime and memory pressure for what it's doing. This patch removes all three of these inefficiencies:
There should be no functional or visual changes in this patch, but it should reduce the overall runtime of page renders and require less memory while rendering. The impact will be proportional to the number of blocks rendered on a page.