Make WordPress Core

Opened 14 years ago

Closed 13 years ago

Last modified 13 years ago

#11144 closed defect (bug) (fixed)

WP importer strips whitespace at EOL, problem with Markdown

Reported by: demetris's profile demetris Owned by:
Milestone: 3.1 Priority: normal
Severity: normal Version: 2.9
Component: Import Keywords: has-patch
Focuses: Cc:

Description

Markdown does not convert line breaks to (X)HTML break tags. Instead, when you want a line break, you end a line with two or more spaces.

The WP WXR exporter retains those spaces at ends of lines. However, when you import the file to WP, the spaces are removed.

Can this be fixed, or is there some reason it should stay as it is now?

Thanks for reading!

Attachments (7)

wp_11144_import_xml.png (6.8 KB) - added by dwright 14 years ago.
the xml to import M are linebreaks
wp_11144.png (42.9 KB) - added by dwright 14 years ago.
html view of the imported linebreaks (shown as whitespace)
wp_11144_b.png (19.5 KB) - added by dwright 14 years ago.
html view of the imported linebreaks cursor is at end of linebreaks
wp-t11144-visual-explanation.png (39.0 KB) - added by demetris 14 years ago.
Visual explanation of the issue
op109.net-2010-03-02-user1.xml (3.8 KB) - added by demetris 14 years ago.
WRX file with post/page that has extra spaces before EOLs
wordpress.php.2.patch (1.4 KB) - added by dwright 14 years ago.
maintain linefeeds and spaces, only for content:encoded tags, pages or posts WP import
wordpress.php.patch (1.4 KB) - added by dwright 14 years ago.
maintain linefeeds and spaces, only for content:encoded tags, pages or posts WP import

Download all attachments as: .zip

Change History (30)

#1 @scribu
14 years ago

  • Keywords reporter-feedback added

when you want a line break, you end a line with two or more spaces.

Don't you mean two or more line breaks?

#2 @demetris
14 years ago

No. Spaces. Like this:

I want an HTML line-break at the end of this line[SPACE][SPACE][LF]

This is standard in Markdown, and the only way to insert an HTML line break. The only implementation that changes it is the Github flavour. (I suppose they changed it because it did not suit well their particular usage, but, in general, the default Markdown thing, treating line breaks as just spaces, is more useful.)

#3 @Denis-de-Bernardy
14 years ago

  • Keywords needs-patch added; reporter-feedback removed

should be fixed imo

#4 @demetris
14 years ago

The issue seems to be related to an rtrim() (unsurprisingly) — in particular, the first of the two instances of rtrim() in the importer:

http://core.trac.wordpress.org/browser/trunk/wp-admin/import/wordpress.php

I have no idea what that line does and why it uses rtrim() (this stuff is too complex for me). Removing the rtrim() from there makes the importer preserve the whitespace at EOLs, but it has a side-effect too: all LFs are duplicated. (So, too much white space, and no line breaks AT ALL any more, either with Markdown or with wpautop().)

That’s as far as my investigation can go. :-)

#5 @hakre
14 years ago

demetris, just for clarification: can you confirm the problem with markdown / whitespace at the end of lines for the current trunk?

#6 @hakre
14 years ago

  • Keywords reporter-feedback added

#7 @demetris
14 years ago

  • Keywords reporter-feedback removed

@hakre:

Yep. Just checked with the current trunk and the problem still exists.

BTW, one does not need Markdown to verify the issue. Just export a post that has some lines with whitespace at the end, and then import it: The whitespace is preserved in the XML file, but is stripped during the import.

#9 @dwright
14 years ago

  • Keywords has-patch added; needs-patch removed

#10 @dwright
14 years ago

  • Cc david_v_wright@… added

#11 @demetris
14 years ago

  • Keywords needs-patch added; has-patch removed

Thanks for the patch, dwright.

I tested it with current trunk (r13304) but the behaviour remains the same: the spaces at EOLs are there in the XML file but are missing after importing.

@dwright
14 years ago

the xml to import M are linebreaks

@dwright
14 years ago

html view of the imported linebreaks (shown as whitespace)

@dwright
14 years ago

html view of the imported linebreaks cursor is at end of linebreaks

#12 @dwright
14 years ago

I tested it with current trunk (r13304) but the behaviour remains the same: the spaces at EOLs are there in the XML file but are missing after importing.

I guess I don't understand what the issue is then.

see attached screenshots showing the imported newlines/whitespace

(the whitespace is killed in visual mode but preserved in html with included patch)

can you maybe send an image of what your after?

#13 follow-up: @demetris
14 years ago

Thanks for the reply, dwright.

So, in your tests, importing does not affect strings of spaces before EOLs.

While in my tests those strings of spaces are completely removed. (I also verified the issue by looking at the post_content in the DB entry.)

The only difference of environments I can think of is the Visual Editor, which I always disable in my user settings. But should this have any effect on the process of importing?

I attach a visual explanation of what happens whenever I import, and why this is a problem when you Markdown is used.

@demetris
14 years ago

Visual explanation of the issue

#14 in reply to: ↑ 13 @dwright
14 years ago

Replying to demetris:

Thanks for the reply, dwright.

So, in your tests, importing does not affect strings of spaces before EOLs.

While in my tests those strings of spaces are completely removed. (I also verified the issue by looking at the post_content in the DB entry.)

So you are saying it looks like it's working as desired in my environment and not in yours?

If so, are you sure the patch applied?

Take this session:

post as originally written:

mysql>  select post_content from wp_posts where ID=6;
+------------------------+
| post_content           |
+------------------------+
| test 



          | 
+------------------------+

export xml file.
post deleted. (and trash emptied)
xml file imported (without patch)

mysql>  select post_content from wp_posts where ID=6;
+--------------+
| post_content |
+--------------+
| test         | 
+--------------+
1 row in set (0.00 sec)

This is the behaviour I believe you are seeing, correct?

Now, apply the patch:

[dwright:wordpress]$ patch wp-admin/import/wordpress.php < wp-admin/import/wordpress.php.patch
patching file wp-admin/import/wordpress.php
[dwright:wordpress]$ svn diff
Index: wp-admin/import/wordpress.php
===================================================================
--- wp-admin/import/wordpress.php	(revision 13450)
+++ wp-admin/import/wordpress.php	(working copy)
@@ -56,7 +56,8 @@
 		preg_match("|<$tag.*?>(.*?)</$tag>|is", $string, $return);
 		if ( isset($return[1]) ) {
 			$return = preg_replace('|^<!\[CDATA\[(.*)\]\]>$|s', '$1', $return[1]);
-			$return = $wpdb->escape( trim( $return ) );
+			if ($tag == 'content:encoded') $return = $wpdb->escape( trim( $return, "\t\0\x0B" ) );
+			else $return = $wpdb->escape( trim( $return ) );
 		} else {
 			$return = '';
 		}

delete post ID=6; empty trash; restart webserver and import the same xml file.

mysql>  select post_content from wp_posts where ID=6;
+-------------------+
| post_content      |
+-------------------+
| test



          | 
+-------------------+
1 row in set (0.00 sec)

This is the result I believe you want correct?

#15 @Denis-de-Bernardy
14 years ago

why not simply change:

$return = $wpdb->escape( trim( $return ) ); 

into:

$return = $wpdb->escape( $return ); 

there isn't any reason to trim anything here.

#16 follow-up: @demetris
14 years ago

dwright, I tried again but the problem remains.

I imported my sample post/page into two different WP installations, after making sure that the patch had been applied correctly. I’m attaching the WRX file, if anyone else wants to give it a quick try, to make sure I’m not doing something wrong.

@demetris
14 years ago

WRX file with post/page that has extra spaces before EOLs

#17 in reply to: ↑ 16 ; follow-up: @dwright
14 years ago

I see, thanks for the attachment demetris.

My tests were importing 'posts', not 'pages'. (it's a different process code-wise)

Here is a patch which will work for either. It's not the greatest patch but is very conservative, only effecting '<content:encoded>'.

Denis, possibly that is true, I wanted to keep the patch as conservative as possible, so I left in trim (removing spaces and newline's for content only).

In this new patch (for page imports, the 'rtrim($importline, "\n\r\0\x0B");' is crucial. RE: '$this->post .= $importline . "\n";' affix's a newline to content. (so if rtrim is removed, the content would have double line breaks, as demetris originally found out.)

Note, I didn't run the WP test suite.

@dwright
14 years ago

maintain linefeeds and spaces, only for content:encoded tags, pages or posts WP import

@dwright
14 years ago

maintain linefeeds and spaces, only for content:encoded tags, pages or posts WP import

#18 @dwright
14 years ago

  • Keywords has-patch added; needs-patch removed

#19 in reply to: ↑ 17 @demetris
14 years ago

Replying to dwright:

I see, thanks for the attachment demetris.

My tests were importing 'posts', not 'pages'. (it's a different process code-wise)

Here is a patch which will work for either. It's not the greatest patch but is very conservative, only effecting '<content:encoded>'.

For some reason I was only testing with Pages and it had not occured to me that the process might be different. :-|

The new patch fixes the issue for both Posts and Pages.

Thank you for your time and I apologize for the confusion.

#20 @nacin
14 years ago

  • Milestone changed from 3.0 to Future Release

#21 @nacin
13 years ago

  • Resolution set to fixed
  • Status changed from new to closed

(In [15961]) Importer and exporter overhaul, mega props duck.

Exporter overhaul:

  • Add author information to export
  • Greater usage of slug identifiers
  • Don't export auto-drafts, spam comments, or edit lock/last meta keys
  • Inline documentation improvements
  • Remove filtering for now (@todo)
  • Bump WXR version to 1.1, but remain back compat in the importer

Importer overhaul (http://plugins.trac.wordpress.org/changeset/304249):

  • Use an XML parser where available (SimpleXML, XML Parser)
  • Proper import support for navigation menus
  • Many bug fixes, specifically improvements to category and custom taxonomy handling
  • Better author/user mapping

Fixes #5447 #5460 #7400 #7973 #8471 #9237 #10319 #11118 #11144 #11354 #11574 #12685 #13364 #13394 #13453 #13454 #13627 #14306 #14442 #14524 #14750 #15055 #15091 #15108.

See #15197.

#22 @nacin
13 years ago

  • Milestone changed from Future Release to 3.1

#23 @demetris
13 years ago

Just a note to confirm that the overhaul fixed this — just tested with the latest importer, version 0.3 beta 4.

Thanks duck!

Note: See TracTickets for help on using tickets.