#5460 closed defect (bug) (fixed)
WXR importer doesn't like XML files with </item><item>
Reported by: | JeremyVisser | Owned by: | westi |
---|---|---|---|
Milestone: | 3.1 | Priority: | low |
Severity: | normal | Version: | 2.8.4 |
Component: | Import | Keywords: | needs-patch |
Focuses: | Cc: |
Description
I'm trying to import a WXR file that contains the following:
</item><item>
...in between different blog entries. However, WordPress keels over, and doesn't import the file properly.
Getting a text editor and manually find-and-replace'ing all occurrences of </item><item>
with:
</item> <item>
...works.
The main problem is that WordPress doesn't actually use an XML parser to import stuff. However, I set the ticket severity to "minor", as WordPress itself doesn't generate WXR files that contain </item><item>
— it's only when using a particular third-party generator that this happens. However, it's WordPress' fault that it's failing, as there should really be no difference between the above two examples.
Attachments (2)
Change History (20)
#2
@
17 years ago
If you try and import that file into WordPress, only the post at the bottom of the file, Hello World, is imported. If you then open the .xml
file in an editor, and insert a linebreak between all the </item><item>
s, the other posts will import successfully.
#3
@
17 years ago
Looking at the current code this may not be easy to fix without switching to a streaming xml parser.
#4
@
16 years ago
This is caused by the line-by-line parsing nature of the wordpress import code, and the assumption that only <item> or </item> can appear on a single line.
The fix handles the case when both <item> and </item> appear on a single line.
Tested it with the xml input file the user provided, as well as regular standard xml input file. It works correctly now.
#5
@
16 years ago
I think the current import code can be improved substantially IF we utilize an XML parser such as SimpleXMLElement, which is available in PHP 5.x.
Current parsing is line-by-line, and it calls get_entries() three times, resulting in three times of line-by-line parsing of the whole input file.
Besides, the assumption that <item> or <wp:category>(.*?)</wp:category>, or <wp:tag>(.*?)</wp:tag>, will always appear in one single line is fragile. They may well appear in different lines.
With an XML parser, we can fix all these potential issues, and improve the import speed.
#8
@
15 years ago
- Keywords has-patch tested commit added
There is a patch here, that is just PHP4, why not get it committed? :)
Just tested this on latest Subversion (rev 11498)
#10
@
15 years ago
- Cc benz001 added
- Version changed from 2.3.1 to 2.8.4
Yes there is still interest!
I've stumbled across this importing a hand-rolled minimal XML file in 2.8.4, in my case the <item> nodes were in separate lines, but no other line breaks existed - hence the import failed eg
<item><category><![CDATA[surfaces]]></category><category domain="category" nicename="surfaces"><![CDATA[surfaces]]></category><title>Stainless Steel Surface (236 x 114cm) -1</title><content:encoded><![CDATA[<img href="wp-content/uploads/old/surfaces/IMG_6324.jpg" /><span>Stainless Steele Surface (236 x 114cm) <i>$80</i></span>]]></content:encoded></item>
FAIL
<item> <category><![CDATA[surfaces]]></category> <category domain="category" nicename="surfaces"><![CDATA[surfaces]]></category> <title>Grey Laminate Surface with stainless steel sides (246 x 60cm)</title> <content:encoded><![CDATA[<img href="wp-content/uploads/old/surfaces/IMG_6326.jpg" /><span>Grey Laminate Surfacewith stainless steele sides (246 x 60cm) <i>$80</i></span>]]></content:encoded> </item>
PASS
You can imagine how surprised I was at having to worry about whitespace in an XML file.
Does the patch listed above cover this example?
#11
follow-up:
↓ 13
@
15 years ago
Just answered my question, the diff file above clearly doesn't as it handles the item tag as a special case. Rats.
Any plans to switch to proper XML parsing?
#12
@
15 years ago
- Severity changed from minor to normal
And just found that it also dies with too many linebreaks eg if you pass the XML through a formatter like eclipsetidy
<item> <category> <![CDATA[surfaces]]> </category> <category domain="category" nicename="surfaces"> <![CDATA[surfaces]]> </category> <title> Antique zinc surface (100x170x 2cm) -1 </title> <content:encoded> <![CDATA[<img href='images/100_1439.jpg' title='click'><span>Antique zinc surface (100x170x 2cm)<i>$50</i></span>]]> </content:encoded> </item>
In this case the item does import, but the post content includes the <![CDATA markup and so doesn't show in the post except in HTML view, it appears that everything inside the <content:encoded> tag is put into the post content, not everything within the CDATA section.
The following does work
<item> <category> <![CDATA[surfaces]]> </category> <category domain="category" nicename="surfaces"> <![CDATA[surfaces]]> </category> <title> Antique zinc surface (100x170x 2cm) -1 </title> <content:encoded><![CDATA[<img href='images/100_1439.jpg' title='click'><span>Antique zinc surface (100x170x 2cm)<i>$50</i></span>]]></content:encoded> </item>
Note that the linebreak within category prior to the category CDATA doesn't stop the categories importing properly, its just the post content that stuffs up.
This really does make using the wordpress import for external conversion tools/any kind of custom post population quite painful.
#13
in reply to:
↑ 11
;
follow-up:
↓ 14
@
15 years ago
Replying to benz001:
Just answered my question, the diff file above clearly doesn't as it handles the item tag as a special case. Rats.
Any plans to switch to proper XML parsing?
Possibly. Anything we have needs to support PHP4 and PHP5 which makes XML Parsing not fun :-)
But if someone wrote a good PHP5 XML Parser based WXR importer maybe it could be included and used if PHP5 available.
#14
in reply to:
↑ 13
@
15 years ago
Replying to westi:
Possibly. Anything we have needs to support PHP4 and PHP5 which makes XML Parsing not fun :-)
But if someone wrote a good PHP5 XML Parser based WXR importer maybe it could be included and used if PHP5 available.
Mmm...that sounds agreeable with me. Would an importer written with SimpleXML be acceptable, or would it have to be a little more complex?
#15
@
15 years ago
- Keywords needs-patch added; has-patch tested commit removed
- Milestone changed from 2.9 to Future Release
Move to Future Release for now.
I think we would do better to use an XML parser and have some backcompat for PHP4
#16
@
14 years ago
- Resolution set to fixed
- Status changed from accepted to closed
(In [15961]) Importer and exporter overhaul, mega props duck.
Exporter overhaul:
- Add author information to export
- Greater usage of slug identifiers
- Don't export auto-drafts, spam comments, or edit lock/last meta keys
- Inline documentation improvements
- Remove filtering for now (@todo)
- Bump WXR version to 1.1, but remain back compat in the importer
Importer overhaul (http://plugins.trac.wordpress.org/changeset/304249):
- Use an XML parser where available (SimpleXML, XML Parser)
- Proper import support for navigation menus
- Many bug fixes, specifically improvements to category and custom taxonomy handling
- Better author/user mapping
Fixes #5447 #5460 #7400 #7973 #8471 #9237 #10319 #11118 #11144 #11354 #11574 #12685 #13364 #13394 #13453 #13454 #13627 #14306 #14442 #14524 #14750 #15055 #15091 #15108.
See #15197.
Please attach an example of this file that WordPress cannot parse.