Make WordPress Core

Opened 17 years ago

Closed 14 years ago

Last modified 14 years ago

#5460 closed defect (bug) (fixed)

WXR importer doesn't like XML files with </item><item>

Reported by: jeremyvisser's profile JeremyVisser Owned by: westi's profile westi
Milestone: 3.1 Priority: low
Severity: normal Version: 2.8.4
Component: Import Keywords: needs-patch
Focuses: Cc:

Description

I'm trying to import a WXR file that contains the following:

</item><item>

...in between different blog entries. However, WordPress keels over, and doesn't import the file properly.

Getting a text editor and manually find-and-replace'ing all occurrences of </item><item> with:

</item>
<item>

...works.

The main problem is that WordPress doesn't actually use an XML parser to import stuff. However, I set the ticket severity to "minor", as WordPress itself doesn't generate WXR files that contain </item><item> — it's only when using a particular third-party generator that this happens. However, it's WordPress' fault that it's failing, as there should really be no difference between the above two examples.

Attachments (2)

wordpress.xml (2.4 KB) - added by JeremyVisser 17 years ago.
Hand-crafted WXR file to showcase the bug.
5460_get_entries.diff (1.4 KB) - added by hailin 16 years ago.
patch

Download all attachments as: .zip

Change History (20)

#1 @westi
17 years ago

  • Owner changed from anonymous to westi
  • Status changed from new to assigned

Please attach an example of this file that WordPress cannot parse.

@JeremyVisser
17 years ago

Hand-crafted WXR file to showcase the bug.

#2 @JeremyVisser
17 years ago

If you try and import that file into WordPress, only the post at the bottom of the file, Hello World, is imported. If you then open the .xml file in an editor, and insert a linebreak between all the </item><item>s, the other posts will import successfully.

#3 @westi
17 years ago

Looking at the current code this may not be easy to fix without switching to a streaming xml parser.

@hailin
16 years ago

patch

#4 @hailin
16 years ago

This is caused by the line-by-line parsing nature of the wordpress import code, and the assumption that only <item> or </item> can appear on a single line.

The fix handles the case when both <item> and </item> appear on a single line.

Tested it with the xml input file the user provided, as well as regular standard xml input file. It works correctly now.

#5 @hailin
16 years ago

I think the current import code can be improved substantially IF we utilize an XML parser such as SimpleXMLElement, which is available in PHP 5.x.

Current parsing is line-by-line, and it calls get_entries() three times, resulting in three times of line-by-line parsing of the whole input file.

Besides, the assumption that <item> or <wp:category>(.*?)</wp:category>, or <wp:tag>(.*?)</wp:tag>, will always appear in one single line is fragile. They may well appear in different lines.

With an XML parser, we can fix all these potential issues, and improve the import speed.

#6 @Denis-de-Bernardy
15 years ago

aye. but it would require php5. :-)

when are we switching WP to php5 anyway?

#7 @Denis-de-Bernardy
15 years ago

  • Component changed from General to Import

#8 @joostdevalk
15 years ago

  • Keywords has-patch tested commit added

There is a patch here, that is just PHP4, why not get it committed? :)

Just tested this on latest Subversion (rev 11498)

#9 @Denis-de-Bernardy
15 years ago

still applies clean, in case there is any interest.

#10 @benz001
15 years ago

  • Cc benz001 added
  • Version changed from 2.3.1 to 2.8.4

Yes there is still interest!
I've stumbled across this importing a hand-rolled minimal XML file in 2.8.4, in my case the <item> nodes were in separate lines, but no other line breaks existed - hence the import failed eg

<item><category><![CDATA[surfaces]]></category><category domain="category" nicename="surfaces"><![CDATA[surfaces]]></category><title>Stainless Steel Surface (236 x 114cm) -1</title><content:encoded><![CDATA[<img href="wp-content/uploads/old/surfaces/IMG_6324.jpg" /><span>Stainless Steele Surface (236 x 114cm) <i>$80</i></span>]]></content:encoded></item>

FAIL

<item>
	<category><![CDATA[surfaces]]></category>
	<category domain="category" nicename="surfaces"><![CDATA[surfaces]]></category>
	<title>Grey Laminate Surface with stainless steel sides (246 x 60cm)</title>
	<content:encoded><![CDATA[<img href="wp-content/uploads/old/surfaces/IMG_6326.jpg" /><span>Grey Laminate Surfacewith stainless steele sides (246 x 60cm) <i>$80</i></span>]]></content:encoded>
</item>

PASS

You can imagine how surprised I was at having to worry about whitespace in an XML file.
Does the patch listed above cover this example?

#11 follow-up: @benz001
15 years ago

Just answered my question, the diff file above clearly doesn't as it handles the item tag as a special case. Rats.

Any plans to switch to proper XML parsing?

#12 @benz001
15 years ago

  • Severity changed from minor to normal

And just found that it also dies with too many linebreaks eg if you pass the XML through a formatter like eclipsetidy

        <item>
            <category>
                <![CDATA[surfaces]]>
            </category>
            <category domain="category" nicename="surfaces">
                <![CDATA[surfaces]]>
            </category>
            <title>
                Antique zinc surface (100x170x 2cm) -1
            </title>
            <content:encoded>
                <![CDATA[<img href='images/100_1439.jpg' title='click'><span>Antique zinc surface (100x170x 2cm)<i>$50</i></span>]]>
            </content:encoded>
        </item>

In this case the item does import, but the post content includes the <![CDATA markup and so doesn't show in the post except in HTML view, it appears that everything inside the <content:encoded> tag is put into the post content, not everything within the CDATA section.

The following does work

        <item>
            <category>
                <![CDATA[surfaces]]>
            </category>
            <category domain="category" nicename="surfaces">
                <![CDATA[surfaces]]>
            </category>
            <title>
                Antique zinc surface (100x170x 2cm) -1
            </title>
            <content:encoded><![CDATA[<img href='images/100_1439.jpg' title='click'><span>Antique zinc surface (100x170x 2cm)<i>$50</i></span>]]></content:encoded>
        </item>

Note that the linebreak within category prior to the category CDATA doesn't stop the categories importing properly, its just the post content that stuffs up.

This really does make using the wordpress import for external conversion tools/any kind of custom post population quite painful.

#13 in reply to: ↑ 11 ; follow-up: @westi
15 years ago

Replying to benz001:

Just answered my question, the diff file above clearly doesn't as it handles the item tag as a special case. Rats.

Any plans to switch to proper XML parsing?

Possibly. Anything we have needs to support PHP4 and PHP5 which makes XML Parsing not fun :-)

But if someone wrote a good PHP5 XML Parser based WXR importer maybe it could be included and used if PHP5 available.

#14 in reply to: ↑ 13 @JeremyVisser
15 years ago

Replying to westi:

Possibly. Anything we have needs to support PHP4 and PHP5 which makes XML Parsing not fun :-)

But if someone wrote a good PHP5 XML Parser based WXR importer maybe it could be included and used if PHP5 available.

Mmm...that sounds agreeable with me. Would an importer written with SimpleXML be acceptable, or would it have to be a little more complex?

#15 @westi
15 years ago

  • Keywords needs-patch added; has-patch tested commit removed
  • Milestone changed from 2.9 to Future Release

Move to Future Release for now.

I think we would do better to use an XML parser and have some backcompat for PHP4

#16 @nacin
14 years ago

  • Resolution set to fixed
  • Status changed from accepted to closed

(In [15961]) Importer and exporter overhaul, mega props duck.

Exporter overhaul:

  • Add author information to export
  • Greater usage of slug identifiers
  • Don't export auto-drafts, spam comments, or edit lock/last meta keys
  • Inline documentation improvements
  • Remove filtering for now (@todo)
  • Bump WXR version to 1.1, but remain back compat in the importer

Importer overhaul (http://plugins.trac.wordpress.org/changeset/304249):

  • Use an XML parser where available (SimpleXML, XML Parser)
  • Proper import support for navigation menus
  • Many bug fixes, specifically improvements to category and custom taxonomy handling
  • Better author/user mapping

Fixes #5447 #5460 #7400 #7973 #8471 #9237 #10319 #11118 #11144 #11354 #11574 #12685 #13364 #13394 #13453 #13454 #13627 #14306 #14442 #14524 #14750 #15055 #15091 #15108.

See #15197.

#17 @JeremyVisser
14 years ago

Awesome work, dude! Talk about killing two, I mean, twenty-four birds with one stone!

#18 @nacin
14 years ago

  • Milestone changed from Future Release to 3.1
Note: See TracTickets for help on using tickets.