Opened 16 years ago
Closed 16 years ago
#8999 closed task (blessed) (fixed)
Completely New LiveJournal Importer
Reported by: | beaulebens | Owned by: | |
---|---|---|---|
Milestone: | 2.8 | Priority: | normal |
Severity: | normal | Version: | 2.7 |
Component: | Import | Keywords: | needs-testing has-patch |
Focuses: | Cc: |
Description
Attached is a diff that, when applied to a recent revision on trunk (developed against r10375) will replace the current LiveJournal importer with a new version.
This one uses their API to import posts and comments, and has been tested to handle over 3,500 posts and 190,000 comments. The data handling is all split up and uses AJAX to try to allow it to handle bigger imports.
It also tries to load as much meta data on posts in LJ as custom fields/postmeta in WP, and translates lj-cut into < !--more--> and lj-user into real links.
It'd be great to get some more people to try this out and see if it's "core-worthy".
Attachments (3)
Change History (51)
#3
@
16 years ago
I tested the importer against my Livejournal.
The import itself went well, no issues. I haven't checked every single entry but it seems to have got the entries and comments effectively.
One small issue, it seems to struggle with single quotes in the comments - it's actually showing the code rathe than the quote as, for example
it's
Within the articles themselves the single quotes are being handled correctly.
#4
@
16 years ago
In terms of suggested possible improvements, maybe the option to control the author of the journal entries and the journal owner's comments, and the ability to restrict import by date. But as a one-off it's great. :)
#5
@
16 years ago
Thanks for testing it out mrmist.
Can you please have a look at the source code of one of your comments and see what it looks like where it's doing this double-escaping on quotes? I see what you've given as an example there in the source, so in the actual output, I see the correct thing (an apostrophe).
By the sounds of it, the actual source code of your comments will be something like:
it&amp;apos;s
And a couple of questions -- Are you running any other plugins or any other modifications of any kind on the install you're testing on? What version of PHP are you using? What exactly does the code look like on LiveJournal's end for the same comment?
Thanks for helping with this!
#6
@
16 years ago
Hmm. OK I am somewhat confused by this now, because it's rendering them correctly in Opera but not in IE7. Almost like IE is not translating the element. Could be an issue local to me.
The original livejournal comment's source includes the quote mark as a quote mark.
The imported comment's source includes the quote mark as '
In livejournal entry text, the quote marks in the source are shown as quote marks.
In imported entry text, the quote marks in the source as shown as element ’
That's tested from my local test svn which runs off IIS7, php5, default theme and doesn't run any plugins.
I'll do a quick check on my externally hosted svn site too.
#7
@
16 years ago
Unrelated to the other minor issue, my other svn test site runs with debug on and you get quite a few notices that you might want to tidy up -
Notice: ob_flush() [ref.outcontrol]: failed to flush buffer. No buffer to flush. in /home/www/svn/trunk/wp-admin/import/livejournal.php on line 864
Notice: Undefined index: security in /home/www/svn/trunk/wp-admin/import/livejournal.php on line 322
Notice: Undefined index: _ajax_nonce in /home/www/svn/trunk/wp-includes/pluggable.php on line 783
...
Notice: Undefined offset: 1 in /home/www/svn/trunk/wp-admin/import/livejournal.php on line 579
Notice: Undefined offset: 1 in /home/www/svn/trunk/wp-admin/import/livejournal.php on line 599
#8
@
16 years ago
Entity blindness I think. It should be pos not apos. apos is XML.
(It's the same on my other site.)
#10
@
16 years ago
It looks like you're right, I checked it out and there's reference to other people having a similar problem: e.g. http://fishbowl.pastiche.org/2003/07/01/the_curse_of_apos/
So basically IE actually does the correct thing (shock, horror!) and other browsers are more lenient in this case.
I think this is coming from the ::unhtmlentities() method that I have, which was copy-pasted from another importer, so we might need to propagate this fix across the others once we nail it down. It looks like there's an additional parameter that can be used on get_html_translation_table() to alter how quotes are handled so I'll play with it.
#11
@
16 years ago
Try changing line 238 (first line of function unhtmlentities()) to this:
$trans_tbl = get_html_translation_table(HTML_ENTITIES, ENT_QUOTES);
and see if that fixes it for you. That's giving me "clean" single-quotes (not encoded as entities at all, but then neither is the data that's coming from LJ). They appear correctly in FF (I don't have IE to test on here).
@
16 years ago
Drops ::unhtmlentities() entirely and uses html_entity_decode() to avoid single quote problems. Tested in FF and IE6 and looks good.
#14
@
16 years ago
Writing var_export()s to files is rather scary. I'd rather not have an importer dependent on being able to write to the filesystem.
#18
follow-up:
↓ 26
@
16 years ago
A couple points:
- Most of the importers write to the filesystem in some way or another (upload files etc)
- I'm open to another way of doing it, but there's no way to handle big imports from the LJ API that I could come up with other than files since the data is very likely to exceed memory limits (especially when you have to juggle multiple copies of things to re-thread comments, which aren't threaded when you receive them)
- I agree the var_export thing is kinda scary, but it performed significantly better than serialize/unserialize (and assuming that it's necessary to store things between steps, something needs to be used to maintain array structure)
- I'm aware of the Snoopy issue, and am working on adding cookie support to the HTTP API right now.
I'm totally open to other ideas on how this can work, but that's all I could come up with after a lot of messing around with other options. Big blogs (lots of comments in particular) are just massively problematic if you try to do anything much in memory here.
#19
follow-up:
↓ 20
@
16 years ago
Some comments:
Is there a LJ export file we can import from?
Does it have to use a HTTP API?
Does the API support iteration?
#20
in reply to:
↑ 19
;
follow-up:
↓ 21
@
16 years ago
Replying to westi:
Is there a LJ export file we can import from?
Sort of (that's what the old one used) - if you're OK with individually exporting per month, and then manually importing each file :-/
Does it have to use a HTTP API?
Only known option that gives access to all posts + comments.
Does the API support iteration?
Yes, which is what it uses, but with the comments in particular, you need to "collect them all up" so that you can re-thread them, because they come across the wire in "last edited" order, so they're not very structured (other than their LJ DB ids)
#21
in reply to:
↑ 20
;
follow-up:
↓ 22
@
16 years ago
Replying to beaulebens:
Replying to westi:
Does the API support iteration?
Yes, which is what it uses, but with the comments in particular, you need to "collect them all up" so that you can re-thread them, because they come across the wire in "last edited" order, so they're not very structured (other than their LJ DB ids)
Ok.
Can we not iterate through each comment and add to the relevant post then post process all the comments on a post to rebuild the hierarchy?
#22
in reply to:
↑ 21
@
16 years ago
Replying to westi:
Ok.
Can we not iterate through each comment and add to the relevant post then post process all the comments on a post to rebuild the hierarchy?
That's what I was going to do originally, but without some sort of commentmeta (to store references to their original parent-ids) I couldn't figure out how to do that? Unless we did something sketchy like stored it in one of the other comments fields or something?
#26
in reply to:
↑ 18
;
follow-up:
↓ 27
@
16 years ago
Replying to beaulebens:
A couple points:
- Most of the importers write to the filesystem in some way or another (upload files etc)
They upload one file that goes through all of the various hooks and checks and receive attachment IDs. I don't know of any importeres that write to the filesystem outside of that.
- I'm open to another way of doing it, but there's no way to handle big imports from the LJ API that I could come up with other than files since the data is very likely to exceed memory limits (especially when you have to juggle multiple copies of things to re-thread comments, which aren't threaded when you receive them)
- I agree the var_export thing is kinda scary, but it performed significantly better than serialize/unserialize (and assuming that it's necessary to store things between steps, something needs to be used to maintain array structure)
serialize is scary too. The main concern is doing a php include of this stuff later. Can it not go through the options DB, where we have some checks for proper serialization, at least, and where we don't have to do includes.
- I'm aware of the Snoopy issue, and am working on adding cookie support to the HTTP API right now.
Okay.
#27
in reply to:
↑ 26
;
follow-ups:
↓ 28
↓ 29
@
16 years ago
Replying to ryan:
They upload one file that goes through all of the various hooks and checks and receive attachment IDs. I don't know of any importeres that write to the filesystem outside of that.
True. The first time these files are created, they do use wp_upload_bits() (trying to play nice), then they are overwritten in place after that. I agree it's definitely not ideal tho.
serialize is scary too. The main concern is doing a php include of this stuff later. Can it not go through the options DB, where we have some checks for proper serialization, at least, and where we don't have to do includes.
I ran into memory problems every time I tried collecting things in the DB. The (capacity) test journal that I was using has nearly 200,000 comments, so that adds up pretty quickly and maxes out MySQL's max query size.
Each individual comment could perhaps be stored separately in wp_options with an option_name of something like lj_comment_xxx, but then you wouldn't be able to easily/selectively query them back out to rebuild threads.
I'm just pondering westi's idea a little more to see if there's a way the comments could be inserted into wp_comments, and use the other fields (perhaps comment_type/comment_agent/comment_karma to temporarily maintain the IDs sent from LJ, until threading is completed, then remove/set them back to defaults. Or is that just as hacky? :)
(on the up-side, I just did some quick/non-extensive testing and it appears that var_export escapes things to the extent that if someone put PHP in a comment, it'd be malformed/non-executable when include()d during import -- still testing)
#28
in reply to:
↑ 27
@
16 years ago
Replying to beaulebens:
True. The first time these files are created, they do use wp_upload_bits() (trying to play nice), then they are overwritten in place after that. I agree it's definitely not ideal tho.
(the raw comment files - the var_export() ones don't, because I needed to be able to keep appending to them, rather than open/read/add/write).
#29
in reply to:
↑ 27
;
follow-up:
↓ 31
@
16 years ago
I'm just pondering westi's idea a little more to see if there's a way the comments could be inserted into wp_comments, and use the other fields (perhaps comment_type/comment_agent/comment_karma to temporarily maintain the IDs sent from LJ, until threading is completed, then remove/set them back to defaults. Or is that just as hacky? :)
That seems worth a look. Personally, I'd prefer just about anything to writing out to files.
#30
@
16 years ago
- Resolution fixed deleted
- Status changed from closed to reopened
Re-opening since this is in-progress.
#31
in reply to:
↑ 29
@
16 years ago
Replying to ryan:
I'm just pondering westi's idea a little more to see if there's a way the comments could be inserted into wp_comments, and use the other fields (perhaps comment_type/comment_agent/comment_karma to temporarily maintain the IDs sent from LJ, until threading is completed, then remove/set them back to defaults. Or is that just as hacky? :)
That seems worth a look. Personally, I'd prefer just about anything to writing out to files.
I'm looking into this as an option over the next few days.
#32
follow-up:
↓ 33
@
16 years ago
If we need comment_meta to enable the LJ importer to be scalable then maybe we should add it!
Much better than adding hacky options.
Cooking up a set of comment_meta functions wouldn't take long.
#33
in reply to:
↑ 32
@
16 years ago
Replying to westi:
If we need comment_meta to enable the LJ importer to be scalable then maybe we should add it!
Looks like markjaquith proposed this back in 2006 but it never really gained traction: http://comox.textdrive.com/pipermail/wp-hackers/2006-January/004044.html
Much better than adding hacky options.
Well the way I'm thinking of doing it would be "temporary" in that the hacks would only exist during the import, and technically the columns that I'd be using aren't available from imported LJ data, so there's "no" downside (that I can see?), assuming the importer cleans up after itself properly.
Cooking up a set of comment_meta functions wouldn't take long.
Would be largely copy-paste from post-meta stuff I assume, but I wonder if there's enough call for comment meta to warrant it going into core?
#34
@
16 years ago
We have talked of comment_meta many times.
Ideally we probably want to move to meta_meta so that anything can have meta associated with it.... but that may be a larger project!
#35
follow-up:
↓ 36
@
16 years ago
A note regarding use of files. On wordpress.com and some other MU installations, consecutive page loads aren't guaranteed to hit the same server or even necessarily the same datacenter. Since these files aren't added through the uploader, they may not be replicated to the server[s] handling later parts of the import process.
#36
in reply to:
↑ 35
@
16 years ago
Replying to ryan:
A note regarding use of files. On wordpress.com and some other MU installations, consecutive page loads aren't guaranteed to hit the same server or even necessarily the same datacenter. Since these files aren't added through the uploader, they may not be replicated to the server[s] handling later parts of the import process.
Ryan - I've got the comments being stored/processed from wp_comments, but am still hitting memory limits when processing lots of them in a single loop. I think I'm going to have to break up the loops as I did previously (in the initial version of this) using AJAX. Can I assume that data in wp_comments will be replicated between each request? Or do I need to force a delay in here or something?
#38
follow-up:
↓ 39
@
16 years ago
OK - I've just run a full import (3,750 posts, 200,000 comments) and it took just over 6 HOURS to import (previous, file-based version too just under 3 hours for the same journal). The comments really kill this process -- re-threading takes 3 hours because querying the comments is super slow since I'm using "spare" fields that aren't indexed.
Importing a smaller journal goes quickly and without problem.
2 options here to speed it up for big journals:
- Before re-threading, ALTER TABLE ADD INDEX on the 3 fields that are used. That seems to reduce things down to literally a couple of minutes, then I can DROP INDEX when I'm done, or
- Use a temporary table in MySQL that's optimized for what I'm trying to do (but then I'd have all sorts of custom code that operated outside the normal comments API).
I'd lean towards the first option, but I don't know how that jives with the general approach of core code?
Comments? Ryan?
#39
in reply to:
↑ 38
;
follow-up:
↓ 40
@
16 years ago
Replying to beaulebens:
OK - I've just run a full import (3,750 posts, 200,000 comments) and it took just over 6 HOURS to import (previous, file-based version too just under 3 hours for the same journal). The comments really kill this process -- re-threading takes 3 hours because querying the comments is super slow since I'm using "spare" fields that aren't indexed.
Cool. Although 6 hours is a little long ;-)
Importing a smaller journal goes quickly and without problem.
2 options here to speed it up for big journals:
- Before re-threading, ALTER TABLE ADD INDEX on the 3 fields that are used. That seems to reduce things down to literally a couple of minutes, then I can DROP INDEX when I'm done, or
- Use a temporary table in MySQL that's optimized for what I'm trying to do (but then I'd have all sorts of custom code that operated outside the normal comments API).
I'd lean towards the first option, but I don't know how that jives with the general approach of core code?
What are the extra fields you are using.
Is there any reason why they couldn't just be indexed anyway?
I think we should go with the speediest solution - people won't want to wait 6 hours for there import to complete!
#40
in reply to:
↑ 39
@
16 years ago
Replying to westi:
Cool. Although 6 hours is a little long ;-)
It sure is :-/
What are the extra fields you are using.
I'm trying to maintain threading information from LiveJournal, so I'm using comment_type='livejournal' to indicate which comments we're dealing with, then comment_karma holds the comment's LiveJournal ID, and comment_agent holds the parentid for threading (again, this is LiveJournal's ID). I loop back over the table WHERE comment_type='livejournal' and update the comment_parent field using a lookup to translate the LJ ID to the new WP ID.
Is there any reason why they couldn't just be indexed anyway?
It seems extraneous to index these fields under any normal condition (except for perhaps comment_type). More indexes = slower inserts/updates and wp_comments is already pretty heavily indexed.
I think we should go with the speediest solution - people won't want to wait 6 hours for there import to complete!
That's what I was hoping you'd say :) It's literally going to be 3 ADD INDEX commands, then reverting them with 3 DROP INDEXs, so it shouldn't affect anything external. I'll put it in and run a test to confirm numbers.
#41
follow-up:
↓ 42
@
16 years ago
You would need to do an optimize table too to reclaim unused space and defrag data after deleting the indices. And if someone imports into an existing blog that already has a bunch of comments, adding and deleting indices can take awhile. Can we add indices only when the import is large and there aren't already a bunch of comments in the DB?
#42
in reply to:
↑ 41
;
follow-up:
↓ 43
@
16 years ago
Replying to ryan:
You would need to do an optimize table too to reclaim unused space and defrag data after deleting the indices. And if someone imports into an existing blog that already has a bunch of comments, adding and deleting indices can take awhile. Can we add indices only when the import is large and there aren't already a bunch of comments in the DB?
ATM I'm only adding the indices if there are more than 5000 comments being imported (relatively arbitrary number), but good point on the OPTIMIZE once done. From what I've seen, it only takes around 15 seconds (each) to add the 3 indices on the table, even when there's already 200,000 comments in there. Seems like an acceptable wait given the alternative is waiting for hours for your import to complete?
Deleting indices takes around 7 seconds per index.
I'll add an OPTIMIZE once done (in the event that it did actually add the indices).
#43
in reply to:
↑ 42
@
16 years ago
ATM I'm only adding the indices if there are more than 5000 comments being imported (relatively arbitrary number), but good point on the OPTIMIZE once done. From what I've seen, it only takes around 15 seconds (each) to add the 3 indices on the table, even when there's already 200,000 comments in there. Seems like an acceptable wait given the alternative is waiting for hours for your import to complete?
Cool. Sounds good to me.
#44
@
16 years ago
Attached a new version of this importer. Handles all comments in the wp_comments table using a few "spare" fields during the process, rather than cache files. Optionally adds some indexing as required to speed things up (and removes when done).
Also uses the new cookie support in the HTTP API (#9049) and removed the dependency on Snoopy.
Fixes #9041 as well.
My bad - use this diff instead. Had a renaming issue from dev -> diff.