Make WordPress Core

Opened 12 years ago

Closed 11 years ago

#21914 closed enhancement (fixed)

Improve pingback page block parsing

Reported by: otto42's profile Otto42 Owned by:
Milestone: 3.5 Priority: normal
Severity: normal Version: 3.4.2
Component: Pings/Trackbacks Keywords: has-patch commit
Focuses: Cc:

Description

The code in pingback_ping which reads the remote page and parses the page into a suitable comment for the pingback needs some improvement with the paragraph parse.

This line of code converts blocks into paragraph separators, which are later split up and handled independently:

$linea = preg_replace( "/ <(h1|h2|h3|h4|h5|h6|p|th|td|li|dt|dd|pre|caption|input|textarea|button|body)[^>]*>/", "\n\n", $linea );

Problem is that it's only considering blocks with spaces at the beginning of them, and not also taking closing blocks into consideration as well.

This minor improvement produces better results, I think.

$linea = preg_replace( "/<\/*(h1|h2|h3|h4|h5|h6|p|th|td|li|dt|dd|pre|caption|input|textarea|button|body)[^>]*>/", "\n\n", $linea );

Change is to eliminate the required space in front of the < and to also allow the closing blocks to mark separation of paragraphs.

This tends to give much cleaner results, especially when the last paragraph in the content is the one that contains the pingback link (pretty commonplace). This happens because by allowing closing blocks to separate the paragraphs, stuff added after them (using the_content filters) doesn't get added into that last paragraph as well. Almost all content has a /p at the end of it, and filters that add to the_content area usually use div's or span's to add to it, not p's.

Patch attached for trunk.

Attachments (2)

pingback_ping.diff (791 bytes) - added by Otto42 12 years ago.
minor enhancement to the pingback parsing
test.php (11.2 KB) - added by Otto42 11 years ago.
Demo file

Download all attachments as: .zip

Change History (8)

@Otto42
12 years ago

minor enhancement to the pingback parsing

#1 @nacin
11 years ago

  • Keywords needs-unit-tests added
  • Milestone changed from Awaiting Review to 3.5

#2 @nacin
11 years ago

This is not sanely testable currently. At the very least, some examples would be great.

@Otto42
11 years ago

Demo file

#3 @Otto42
11 years ago

I know this is hard to test, so I attached a demo file. Basically, I pulled out most of the filtering process (the relevant bits) into a function with both methods (1 and 2 = before and after). I also included a big chunk of content from one of my latest blog posts, but with the addition of a link at the end, to simulate the link in a pingback. Then I show the before and after results. Run it on the command line to get the difference.

The result looks like this:

Before:
[...]to comment about anything I missed, or what you see most often, especially if you&#8217;re doing translations yourself. Related posts:Internationalization: You&#8217;re probably doing it[...]

After:
[...]to comment about anything I missed, or what you see most often, especially if you&#8217;re doing translations[...]

Essentially, the first method (what we have now) is excessively sensitive to HTML spacing. If you have compressed HTML, without a lot of newlines, then it tends to bunch things together. When links are at the end of the content (where it's likely to be for things like credit links), then plugins which stick random sharing HTML and such at the end of the content can be perceived as part of the paragraph, and come out in the resulting excerpt.

By eliminating the space requirement, and allowing ending /p's and other blocks to delimit paragraphs as well, the result is sometimes shorter, but definitely cleaner. Most of the time, the results are basically the same when the link is higher in the body. This mainly helps when the link is down towards the end of the content.

#4 @nacin
11 years ago

  • Keywords commit added; needs-unit-tests removed

Okay, I'm convinced.

#5 @nacin
11 years ago

In [22152]:

Improve pingback text extraction by stopping at a closing block-level tag. props Otto42. see #21914.

#6 @ryan
11 years ago

  • Resolution set to fixed
  • Status changed from new to closed
Note: See TracTickets for help on using tickets.