Make WordPress Core

Opened 16 years ago

Last modified 5 years ago

#5678 accepted enhancement

Respectfully strip newlines in some importers

Reported by: jdub's profile jdub Owned by: hansengel's profile hansengel
Milestone: WordPress.org Priority: normal
Severity: normal Version: 2.5
Component: Import Keywords: needs-patch needs-refresh
Focuses: Cc:

Description

Filing this as an enhancement because it could do with some discussion and insight from wiser and more experienced heads before being labelled "defect". :-)

I noticed while helping some users import their blogs that importers of HTML content (such as the RSS importer) don't tidy up superfluous newlines in the import format, which results in unnecessary <br/> elements after wpautop() filtering for display. They turn up in the editor too, which reinforces the problem.

I've adapted one of the filter functions to strip superfluous newlines, and changed my RSS importer to use it. The results have been warmly welcomed by users, who no longer have to clean up their imported blog content. ;-)

strip_newlines() should probably go into wp-includes/formatting.php, if there isn't already a function that already serves this purpose. I couldn't find one, so I adapted this.

Given that similar HTML block/inline-savvy string-replacement code exists in other formatting functions, perhaps there's an opportunity for some refactoring here? I feel kind of silly proposing a function that is almost entirely duplicated from other code in the core.

I've used it immediately before the "Clean up content" section in wp-admin/import/rss.php's get_posts(), and in an Advogato importer that I've written (which also uses HTML as the content format).

function strip_newlines($text) {
	// Respectfully strip unnecessary newlines
	$textarr = preg_split("/(<[^>]+>)/Us", $text, -1, PREG_SPLIT_DELIM_CAPTURE);
	$stop = count($textarr); $skip = false; $output = ''; // loop stuff
	for ($ci = 0; $ci < $stop; $ci++) {
		$curl = $textarr[$ci];
		if (! $skip && isset($curl{0}) && '<' != $curl{0}) { // If it's not a tag
			$curl = preg_replace('/[\n\r]+/', ' ', $curl);
		} elseif (strpos($curl, '<code') !== false || strpos($curl, '<pre') !== false || strpos($curl, '<kbd') !== false || strpos($curl, '<style') !== false || strpos($curl, '<script') !== false) {
			$next = false;
		} else {
			$next = true;
		}
		$output .= $curl;
	}
	return $output;
}

Thoughts?

Attachments (3)

5678.r6625.diff (1.5 KB) - added by hansengel 16 years ago.
Fixes #5678 in r6625 (credit jdub if this is committed, please)
5678.r6631.diff (1.8 KB) - added by hansengel 16 years ago.
Newer patch—adds PHPDoc and renames function to strip_html_newlines.
5678.r6633.diff (1.4 KB) - added by hansengel 16 years ago.
An even-newer patch—simplifies strip_html_newlines()

Download all attachments as: .zip

Change History (21)

#1 @hansengel
16 years ago

  • Milestone changed from 2.6 to 2.5
  • Owner changed from anonymous to hansengel
  • Status changed from new to assigned
  • Version set to 2.5

+1. I'll get working on a patch.

@hansengel
16 years ago

Fixes #5678 in r6625 (credit jdub if this is committed, please)

#2 @hansengel
16 years ago

  • Keywords has-patch tested added

#3 @lloydbudd
16 years ago

+1, I've encountered this issues a few times when importing customers blogs from Movable Type and TypePad.

#4 @mdawaffe
16 years ago

This seems like a good idea, but the function looks overly complicated. $skip and $next are never actually used.

I'd also appreciate a comment for the function saying that it's intended for fully marked up, valid HTML only. The function can't be trusted to strip lines from unencoded text, pre-wpautop half HTML/half text, or invalid HTML. That's fine, it should just be noted.

#5 @jdub
16 years ago

Oh, I didn't think this would go from proposal to trunk-ready quite so quickly -- thanks.

LiveJournal is another importer for which this will be useful.

mdawaffe is absolutely right that the function should only be used for fully marked up HTML, and needs to include a comment to that effect. Perhaps it should be renamed to 'strip_html_newlines' or something?

Luckily, for the proposed use case, that covers a large proportion of import formats. :-)

#6 @lloydbudd
16 years ago

  • Keywords needs-patch added; has-patch tested removed

#7 @westi
16 years ago

  • Keywords needs-doc added

'strip_html_newlines' is a much better name - and don't forget to write some phpdoc for this new function please :-)

@hansengel
16 years ago

Newer patch—adds PHPDoc and renames function to strip_html_newlines.

#8 follow-up: @hansengel
16 years ago

  • Keywords has-patch has-docs added; needs-patch needs-doc removed

#9 in reply to: ↑ 8 @lloydbudd
16 years ago

  • Keywords needs-patch added; has-patch removed

hansengel, the new patch suffers from the same problems as mdawffe identfied above "the function looks overly complicated. $skip and $next are never actually used."

@hansengel
16 years ago

An even-newer patch—simplifies strip_html_newlines()

#10 @jdub
16 years ago

Unfortunately, you've just simplified it slightly too much. :-)

The element whitespace smarts was pretty crucial to why I wrote the function, given that the input can look like this:

<p>This is a paragraph in which the
wpautop() <a href="#elsewhere">wrapping will
happen in the middle</a> of an
anchor, which is an inline element.</p>

<pre>
But at the same time, we wouldn't want
the newlines to be removed from within
a block element such as a <pre>.
</pre>

With element whitespace smarts, the newlines in the middle of the <p> will be removed (so as not to be replaced with a <br/> during wpautop(), but those within the <pre> will not.

I'll write up another simplified patch that keeps this (and will also include using this function in other HTML-consuming importers).

Thanks. :-)

#11 @ffemtcj
16 years ago

  • Milestone changed from 2.5 to 2.6

#12 @Denis-de-Bernardy
15 years ago

  • Component changed from General to Import

#13 @Denis-de-Bernardy
15 years ago

  • Milestone changed from 2.9 to Future Release

#14 follow-up: @RyanMurphy
14 years ago

Since the importers are now plugins, wontfix?

#15 in reply to: ↑ 14 @nacin
14 years ago

Replying to RyanMurphy:

Since the importers are now plugins, wontfix?

Not necessarily. They are core plugins so we still manage them as an extension of core.

#16 @SergeyBiryukov
12 years ago

  • Milestone changed from Future Release to WordPress.org

#17 @chriscct7
9 years ago

  • Keywords needs-refresh added

#18 @DrewAPicture
9 years ago

  • Keywords has-docs removed
Note: See TracTickets for help on using tickets.