Opened 14 years ago
Closed 12 years ago
#14525 closed defect (bug) (fixed)
Blogger importer prepends ">" to all content
Reported by: | mdawaffe | Owned by: | Otto42 |
---|---|---|---|
Milestone: | WordPress.org | Priority: | normal |
Severity: | major | Version: | |
Component: | Import | Keywords: | has-patch dev-feedback |
Focuses: | Cc: |
Description
Import a Blogger blog.
All post_title, post_content, and comment_text (and maybe others) fields will have an extraneous ">" character prepended to their contents.
Attached is a possible fix.
Attachments (2)
Change History (38)
#5
@
13 years ago
Tested 14525.diff with WordPress 3.2.1 and PHP 5.3.2. Works for me: the extra ">" characters aren't added to the beginning of post titles and content.
#8
@
13 years ago
Branch version fixes this and other bugs:
http://plugins.svn.wordpress.org/blogger-importer/branches/oauth/
Needs code review and/or testing. Changes include:
- plugin now uses OAuth for Google Authentication
- SimplePie library used to parse feed data
If can get some testing or code review on it, I'll merge it back into the main plugin. My testing worked, but is necessarily limited in scope. I can't test for things I didn't think of.
#9
@
13 years ago
- Cc daryl@… added
I can confirm that mdawaffe's fix works for the older importer. I also tested the new branch and everything seemed to work properly. My testing was very limited as well.
#10
@
13 years ago
I tested http://plugins.svn.wordpress.org/blogger-importer/branches/oauth/ with WordPress r18444 using test Blogger blog http://design5279.wordpress.com/ which contains 4 posts, 1 page, and 12 comments:
- Had to click the import/Continue button more than once.
- All 4 posts were imported on the first attempt.
- No pages were imported.
- Only 1 of 12 comments were imported.
- "The Magic Button" in the Tools -> Import -> Blogger page now says "Set Authors" instead of "Import" and it appears to be stuck.
I tried the "Clear account information" button again, then went to Tools -> Import -> Blogger and clicked Import again to see if it would get the rest of the comments. Each time I did that process, one new comment was imported. Looks like only one is coming through at a time for some reason.
#12
@
13 years ago
I tested this branch against my Blogger account (6 blogs, biggest 2,724 posts, 1,943 comments). I have been trying to import this blog for 9 months (see #15290).
The good news - the OAuth handshake works, although Google thinks the application is on my local machine and advises me not to trust it (illogically). I assume that's the "Google thing" mentioned in the source.
The better news - unlike previous versions (see #15290), it successfully acquires the Blogger metafeed and reads out metadata for all the blogs. And it even begins to import content.
The bad news: It imports exactly 25 posts and 1 comment. Then wants a kick ("Continue"). This doesn't cause any more import, though. It then goes to "Set Authors". I've not yet been able to get it to venture beyond the first 25 posts yet, despite turning up MAX_RESULTS and MAX_EXECUTION_TIME.
#13
@
13 years ago
- Cc kraftbj added
Tested this patch against a personal account and executed a migration for a client. 997 posts, ~1600 comments.
Executed perfectly. No extra >, no errors, all content transitioned.
#14
follow-up:
↓ 15
@
13 years ago
kraftbj, did you test the 14525.diff patch or the OAuth branch at http://plugins.svn.wordpress.org/blogger-importer/branches/oauth/ ?
#15
in reply to:
↑ 14
@
13 years ago
Replying to designsimply:
kraftbj, did you test the 14525.diff patch or the OAuth branch at http://plugins.svn.wordpress.org/blogger-importer/branches/oauth/ ?
I'd very much like to know that as well!
#17
@
12 years ago
Tried this with the 3.4 version, the > are removed but also no comments are imported
#19
@
12 years ago
Hi all,
I've taken Otto's version and bashed out a load of the bugs as well as making a couple of enhancements and fleshing out the readme.txt. Given the significant differences it won't really work as a diff so here's the whole thing as a zip.
http://alt.workshopshed.com/blogger-importer20120517.zip
Let me know of any issues, hopefully we can get this merged back into the main stream at some point.
The biggest blog I've tested this on is just 410 posts and 269 comments so would be interested to hear if you've got a bigger one.
I've been developping against Wordpress 3.4.
cheers,
Andy
#20
@
12 years ago
Another day, another version....
I found a bunch of other Trac issues and changes recommended by others so I've merged those in where possible.
Details of changes:
0.6 (todays change)
Andy from Workshopshed
- Merged in fix by SergeyBiryukov http://core.trac.wordpress.org/ticket/16012
- Merged in rmccue change to get_total_results to also use SimplePie from http://core.trac.wordpress.org/attachment/ticket/7652/7652-blogger.diff
- Reviewed in rmccue's changes in http://core.trac.wordpress.org/attachment/ticket/7652/7652-separate.diff issues with date handling functions so skipped those
- Moved SimplePie functions in new class WP_SimplePie_Blog_Item incorporating get_draft_status and get_updated and convert date
- Tested comments from source blog GMT-8, destination London (currently GMT-1), comment dates transferred correctly.
- Fixed typo in oauth_get
0.5
- Change by Otto42, rmccue to use Simplepie XML processing rather than Atomparser, http://core.trac.wordpress.org/ticket/14525 ref: http://core.trac.wordpress.org/attachment/ticket/7652/7652-blogger.diff this also fixes http://core.trac.wordpress.org/ticket/15560
- Change by Otto42 to use OAuth rather than AuthSub authentication, should make authentication more reliable
- Fix by Andy from Workshopshed to load comments correctly
- Fix by Andy from Workshopshed to correctly pass the blogger start-index and max-results parameters to oAuth functions and to process more than one batch http://core.trac.wordpress.org/ticket/19096
- Fix by Andy from Workshopshed error about incorrect enqueuing of scripts also changed styles to work the same
- Change by Andy from Workshopshed testing in debug mode and wrapped ajax return into a function to suppress debug messages
- Fix by Andy from Workshopshed notices for undefined variables.
- Change by Andy from Workshopshed Added tooltip to results table to show numbers of posts and comments skipped (duplicates / missing key)
- Fix by Andy from Workshopshed incorrectly checking for duplicates based on only the date and username, this gave false positives when large numbers of comments, particularly anonymous ones.
Sorry that it has to be in zip format. I don't have access to the blogger-import SVN (a branch would be a good idea if anyone has rights to give me permissions), I also tried to do a diff but tortoise does not support generating a diff of multiple files.
#21
@
12 years ago
Attached a patch with Workshopshed's changes. Had to convert EOL markers to Unix format in order for TortoiseSVN to handle the changes properly.
#23
@
12 years ago
@Workshopshed: I am happy to give you commit to the Blogger Importer plugin as long as releases are cleared by one of the core developers (such as myself, westi, or ryan). You can find us during the week in IRC at #wordpress-dev on freenode.
#24
@
12 years ago
Thanks, if I can put this version in a branch then you guys could move it across when you were happy with it. Would that work for you?
Given the numbers of changes I'd definately suggest few other pairs of eyes have a look before it's put in trunk.
You can find me during the week in a shed at the bottom of the garden
#25
@
12 years ago
Just tested the branch on a blog with 2,862 posts and 2,400 comments. Works absolutely fine. I've closed #15290 as fixed, and I'm willing to give this one a hearty ship-it.
Thank you very much for giving this component the dose of concentrated development it needed.
#26
@
12 years ago
I'll be taking a look at the changes for this next week, and will commit a new version after evaluating it.
#27
@
12 years ago
- Cc me@… added
#7652 can probably be closed now that all my patches are merged in.
Indentation for WP_SimplePie_Blog_Item
appears to be incorrect too. I'm not sure why get_updated
and get_published
didn't work from the original patch, but there should be a SimplePie_Misc
method to replace convert_date
. If I find time, I'll try and track that down.
Also, Simplepie_Sanitize
should be SimplePie_Sanitize
, and there's a couple of things that should be capital P'd (dangit).
#28
@
12 years ago
Thanks Ryan, I'll go through the code with a find and replace swaping over the pies.
I saw another thread about the issues with unit testing things like importers. This issue with this particular importer (I've not looked at the others) is that there is tight coupling between the UI code, blog reading, processing and storage. For example...
function show_blogs($iter = 0)
{
if (empty($this->blogs))
{
$xml = $this->oauth_get('https://www.blogger.com/feeds/default/blogs');
could instead be more like
function show_blogs($iter = 0)
{
if (empty($this->blogs))
{
$this->blogs = $this->importer->getbloglist();
We could then set "importer" to be a mock object that does not need to talk to blogger at all. And on the flip side, we could test the importing without the need for the UI.
The changes to incorporate SimplePie have moved us slighty more towards this approach but it would need a lot of effort to make it so that it could be easily unit tested. Unfortunately that's a little more of a commitment to the importer than I'd like to make. I do plan to incorporate image migration (as none of the other tools do that very well) and something for processing of internal links. I've already put in place some processing for the GeoTags but that needs a little work as I spotted that there are some get_latitute etc methods to use.
#29
follow-up:
↓ 31
@
12 years ago
I'm looking at adding images to this importer, does anyone have an opinion on this? Is it within the remit of what the plugin should be doing or should it be separate?
I've been looking at a couple of other plugins that do similar things, but they don't seem to quite to everything I'd expect.
- http://wordpress.org/extend/plugins/remote-images-grabber
- http://notions.okuda.ca/wordpress-plugins/blogger-image-import/
The images on blogger are typically in the format of:
<a href="hihrezimage.jpg"><img src="lowrezimage.jpg"></a>
The images structure of the URLs contains a "size" indicator as one of the "folders" e.g. s144 for a lowresolution file and s800 for a large one. There is a note in the blogger image importer that some of these point at a page rather than an image
- https://lh4.googleusercontent.com/-nt66qhxzDyY/TZOD-RhTYMI/AAAAAAAACd4/Elzm1smRFb4/s144/Ski%2520Trip.jpg (small image)
- https://lh4.googleusercontent.com/-nt66qhxzDyY/TZOD-RhTYMI/AAAAAAAACd4/Elzm1smRFb4/s800/Ski%2520Trip.jpg (large image)
- https://lh4.googleusercontent.com/-nt66qhxzDyY/TZOD-RhTYMI/AAAAAAAACd4/Elzm1smRFb4/s800-h/Ski%2520Trip.jpg (page)
On my blogs which have images back 4 or 5 years now the images are located on xxx.googleusercontent.com, xxx.ggpht.com, xxx.bp.blogspot.com with additionally some of the higher resolution images coming from picasaweb.google.com
Can I confirm that the correct process to handle the images would be something like the following?
- Get the location of the upload folder (wp_upload_dir)
- Check image is not already downloaded
- Confirm that the medium resolution (img src) url points to an image (wp_check_filetype)
- Download the medium resolution image to uploads folder (download_url)
- Read meta data for the image
- Confirm that the high resolution (link href) url points to an image
- Download the high resolution image to uploads folder
- Read meta data for the image
- Add the image details to the database as an attachement
- Add attachment meta data
- Generate thumbnails (wp_generate_attachment_metadata)
- Change link URL to point to the high resolution image
- Change image URL to point to the low resolution image
- Link the attachment to the post
I'm thinking that a separate class that takes a of URLs and handles the downloading and creation of the attachment passing the details back to the caller. It should theoretically be possible to use that in any importer?
I'm thinking that the changes would be along the lines of
- New class that handles the image processes as described above
- WP_SimplePie_BlogItem would be extended to include a get_images function returning a collection of images, possibly in pairs of image and link?
- Blogger_Importer->import_blog change to pass these images to the new class to process them
- Blogger_Importer->import_blog change to store the results in BloggerEntry
- Blogger_Importer->import_blog change to do a find and replace on URLs in the content
- Blogger_Importer->import_post changed to connect the attachement details to the post
Perhaps we could also keep a count of the attachements downloaded?
#30
@
12 years ago
I've looked at the replace_urls in the SimplePie sanitize class and can't work out what it's purpose is. I was expecting it would do something like changing the domain from it's old location to the new domain?
#31
in reply to:
↑ 29
@
12 years ago
Replying to Workshopshed:
Can I confirm that the correct process to handle the images would be something like the following?
Perhaps media_sideload_image()
could be used there?
#32
follow-up:
↓ 33
@
12 years ago
media_sideload_image is indeed the correct way, and it's the method we use in the Tumblr importer. It's a bit finicky though, so if you want to implement that, watch your error handling carefully. Sometimes you just can't get the image and have to leave it where it is.
#33
in reply to:
↑ 32
@
12 years ago
Replying to Otto42:
media_sideload_image is indeed the correct way
Cheers Otto, Sergey, I can see that media_sideload_image does do several of the things needed but it also seems a bit restrictive, I can't work out how I'd use that get both the medium and high resolution images as the same attachment and returning the html seems a bit odd. However it should be possible to use media_handle_sideload to do the bulk of the job so thanks for pointing me in the right direction.
- http://theme.fm/2011/10/how-to-upload-media-via-url-programmatically-in-wordpress-2657/
- http://wordpress.stackexchange.com/questions/30284/media-sideload-image-file-name
- http://core.trac.wordpress.org/ticket/19629
- http://core.trac.wordpress.org/ticket/16777
- http://core.trac.wordpress.org/ticket/16330
The patch works for me, but that the importer gets to that point in the logic with a non string element makes me think something else is broken.