WordPress.org

Make WordPress Core

Opened 4 years ago

Closed 22 months ago

#14525 closed defect (bug) (fixed)

Blogger importer prepends ">" to all content

Reported by: mdawaffe Owned by: Otto42
Milestone: WordPress.org Priority: normal
Severity: major Version:
Component: Import Keywords: has-patch dev-feedback
Focuses: Cc:

Description

Import a Blogger blog.

All post_title, post_content, and comment_text (and maybe others) fields will have an extraneous ">" character prepended to their contents.

Attached is a possible fix.

Attachments (2)

14525.diff (1.2 KB) - added by mdawaffe 4 years ago.
14525.combined.patch (96.9 KB) - added by SergeyBiryukov 2 years ago.

Download all attachments as: .zip

Change History (38)

mdawaffe4 years ago

comment:1 mdawaffe4 years ago

The patch works for me, but that the importer gets to that point in the logic with a non string element makes me think something else is broken.

comment:2 Otto423 years ago

  • Milestone WordPress.org site deleted

comment:3 Otto423 years ago

  • Milestone set to WordPress.org site

comment:4 designsimply3 years ago

Also see #17776 which suggests the problem only happens for PHP 5.3+.

comment:5 designsimply3 years ago

Tested 14525.diff with WordPress 3.2.1 and PHP 5.3.2. Works for me: the extra ">" characters aren't added to the beginning of post titles and content.

comment:6 nacin3 years ago

  • Owner set to duck_
  • Status changed from new to assigned

comment:7 nacin3 years ago

  • Owner changed from duck_ to Otto42

comment:8 Otto423 years ago

Branch version fixes this and other bugs:
http://plugins.svn.wordpress.org/blogger-importer/branches/oauth/

Needs code review and/or testing. Changes include:

  • plugin now uses OAuth for Google Authentication
  • SimplePie library used to parse feed data

If can get some testing or code review on it, I'll merge it back into the main plugin. My testing worked, but is necessarily limited in scope. I can't test for things I didn't think of.

comment:9 dllh3 years ago

  • Cc daryl@… added

I can confirm that mdawaffe's fix works for the older importer. I also tested the new branch and everything seemed to work properly. My testing was very limited as well.

comment:10 designsimply3 years ago

I tested http://plugins.svn.wordpress.org/blogger-importer/branches/oauth/ with WordPress r18444 using test Blogger blog http://design5279.wordpress.com/ which contains 4 posts, 1 page, and 12 comments:

  • Had to click the import/Continue button more than once.
  • All 4 posts were imported on the first attempt.
  • No pages were imported.
  • Only 1 of 12 comments were imported.
  • "The Magic Button" in the Tools -> Import -> Blogger page now says "Set Authors" instead of "Import" and it appears to be stuck.

I tried the "Clear account information" button again, then went to Tools -> Import -> Blogger and clicked Import again to see if it would get the rest of the comments. Each time I did that process, one new comment was imported. Looks like only one is coming through at a time for some reason.

comment:11 designsimply3 years ago

  • Keywords dev-feedback added

comment:12 yorksranter3 years ago

I tested this branch against my Blogger account (6 blogs, biggest 2,724 posts, 1,943 comments). I have been trying to import this blog for 9 months (see #15290).

The good news - the OAuth handshake works, although Google thinks the application is on my local machine and advises me not to trust it (illogically). I assume that's the "Google thing" mentioned in the source.

The better news - unlike previous versions (see #15290), it successfully acquires the Blogger metafeed and reads out metadata for all the blogs. And it even begins to import content.

The bad news: It imports exactly 25 posts and 1 comment. Then wants a kick ("Continue"). This doesn't cause any more import, though. It then goes to "Set Authors". I've not yet been able to get it to venture beyond the first 25 posts yet, despite turning up MAX_RESULTS and MAX_EXECUTION_TIME.

comment:13 kraftbj3 years ago

  • Cc kraftbj added

Tested this patch against a personal account and executed a migration for a client. 997 posts, ~1600 comments.

Executed perfectly. No extra >, no errors, all content transitioned.

comment:14 follow-up: designsimply3 years ago

kraftbj, did you test the 14525.diff patch or the OAuth branch at http://plugins.svn.wordpress.org/blogger-importer/branches/oauth/ ?

comment:15 in reply to: ↑ 14 yorksranter3 years ago

Replying to designsimply:

kraftbj, did you test the 14525.diff patch or the OAuth branch at http://plugins.svn.wordpress.org/blogger-importer/branches/oauth/ ?

I'd very much like to know that as well!

comment:16 yorksranter2 years ago

I have been waiting for my blog to get imported for 18 months.

comment:17 Workshopshed2 years ago

Tried this with the 3.4 version, the > are removed but also no comments are imported

comment:18 Workshopshed2 years ago

-

Last edited 2 years ago by Workshopshed (previous) (diff)

comment:19 Workshopshed2 years ago

Hi all,
I've taken Otto's version and bashed out a load of the bugs as well as making a couple of enhancements. Given the significant differences it won't really work as a diff so here's the whole thing as a zip.

http://alt.workshopshed.com/blogger-importer20120517.zip

Let me know of any issues, hopefully we can get this merged back into the main stream at some point.

The biggest blog I've tested this on is just 410 posts and 269 comments so would be interested to hear if you've got a bigger one.

cheers,

Andy

Version 1, edited 2 years ago by Workshopshed (previous) (next) (diff)

comment:20 Workshopshed2 years ago

Another day, another version....

I found a bunch of other Trac issues and changes recommended by others so I've merged those in where possible.

Details of changes:

0.6 (todays change)

Andy from Workshopshed

0.5

  • Change by Otto42, rmccue to use Simplepie XML processing rather than Atomparser, http://core.trac.wordpress.org/ticket/14525 ref: http://core.trac.wordpress.org/attachment/ticket/7652/7652-blogger.diff this also fixes http://core.trac.wordpress.org/ticket/15560
  • Change by Otto42 to use OAuth rather than AuthSub authentication, should make authentication more reliable
  • Fix by Andy from Workshopshed to load comments correctly
  • Fix by Andy from Workshopshed to correctly pass the blogger start-index and max-results parameters to oAuth functions and to process more than one batch http://core.trac.wordpress.org/ticket/19096
  • Fix by Andy from Workshopshed error about incorrect enqueuing of scripts also changed styles to work the same
  • Change by Andy from Workshopshed testing in debug mode and wrapped ajax return into a function to suppress debug messages
  • Fix by Andy from Workshopshed notices for undefined variables.
  • Change by Andy from Workshopshed Added tooltip to results table to show numbers of posts and comments skipped (duplicates / missing key)
  • Fix by Andy from Workshopshed incorrectly checking for duplicates based on only the date and username, this gave false positives when large numbers of comments, particularly anonymous ones.

Sorry that it has to be in zip format. I don't have access to the blogger-import SVN (a branch would be a good idea if anyone has rights to give me permissions), I also tried to do a diff but tortoise does not support generating a diff of multiple files.

http://alt.workshopshed.com/blogger-importer20120518.zip

comment:21 SergeyBiryukov2 years ago

Attached a patch with Workshopshed's changes. Had to convert EOL markers to Unix format in order for TortoiseSVN to handle the changes properly.

comment:22 Workshopshed2 years ago

Thanks Sergey

comment:23 nacin2 years ago

@Workshopshed: I am happy to give you commit to the Blogger Importer plugin as long as releases are cleared by one of the core developers (such as myself, westi, or ryan). You can find us during the week in IRC at #wordpress-dev on freenode.

comment:24 Workshopshed2 years ago

Thanks, if I can put this version in a branch then you guys could move it across when you were happy with it. Would that work for you?

Given the numbers of changes I'd definately suggest few other pairs of eyes have a look before it's put in trunk.

You can find me during the week in a shed at the bottom of the garden

Last edited 2 years ago by Workshopshed (previous) (diff)

comment:25 yorksranter2 years ago

Just tested the branch on a blog with 2,862 posts and 2,400 comments. Works absolutely fine. I've closed #15290 as fixed, and I'm willing to give this one a hearty ship-it.

Thank you very much for giving this component the dose of concentrated development it needed.

comment:26 Otto422 years ago

I'll be taking a look at the changes for this next week, and will commit a new version after evaluating it.

comment:27 rmccue2 years ago

  • Cc me@… added

#7652 can probably be closed now that all my patches are merged in.

Indentation for WP_SimplePie_Blog_Item appears to be incorrect too. I'm not sure why get_updated and get_published didn't work from the original patch, but there should be a SimplePie_Misc method to replace convert_date. If I find time, I'll try and track that down.

Also, Simplepie_Sanitize should be SimplePie_Sanitize, and there's a couple of things that should be capital P'd (dangit).

comment:28 Workshopshed23 months ago

Thanks Ryan, I'll go through the code with a find and replace swaping over the pies.

I saw another thread about the issues with unit testing things like importers. This issue with this particular importer (I've not looked at the others) is that there is tight coupling between the UI code, blog reading, processing and storage. For example...

function show_blogs($iter = 0)
        {
            if (empty($this->blogs))
            {
                $xml = $this->oauth_get('https://www.blogger.com/feeds/default/blogs');

could instead be more like

function show_blogs($iter = 0)
        {
            if (empty($this->blogs))
            {
                $this->blogs = $this->importer->getbloglist();

We could then set "importer" to be a mock object that does not need to talk to blogger at all. And on the flip side, we could test the importing without the need for the UI.

The changes to incorporate SimplePie have moved us towards this approach but it would need a lot of effort to make it so that it could be easily unit tested. Unfortunately that's a little more of a commitment to the importer than I'd like to make. I do plan to incorporate image migration (as none of the other tools do that very well) and something for processing of internal links. I've already put in place some processing for the GeoTags but that needs a little work as I spotted that there are some get_latitute etc methods to use.

Last edited 23 months ago by Workshopshed (previous) (diff)

comment:29 follow-up: Workshopshed23 months ago

I'm looking at adding images to this importer, does anyone have an opinion on this? Is it within the remit of what the plugin should be doing or should it be separate?

I've been looking at a couple of other plugins that do similar things, but they don't seem to quite to everything I'd expect.

The images on blogger are typically in the format of:

<a href="hihrezimage.jpg"><img src="lowrezimage.jpg"></a>

The images structure of the URLs contains a "size" indicator as one of the "folders" e.g. s144 for a lowresolution file and s800 for a large one. There is a note in the blogger image importer that some of these point at a page rather than an image

On my blogs which have images back 4 or 5 years now the images are located on xxx.googleusercontent.com, xxx.ggpht.com, xxx.bp.blogspot.com with additionally some of the higher resolution images coming from picasaweb.google.com

Can I confirm that the correct process to handle the images would be something like the following?

  • Get the location of the upload folder (wp_upload_dir)
  • Check image is not already downloaded
  • Confirm that the medium resolution (img src) url points to an image (wp_check_filetype)
  • Download the medium resolution image to uploads folder (download_url)
  • Read meta data for the image
  • Confirm that the high resolution (link href) url points to an image
  • Download the high resolution image to uploads folder
  • Read meta data for the image
  • Add the image details to the database as an attachement
  • Add attachment meta data
  • Generate thumbnails (wp_generate_attachment_metadata)
  • Change link URL to point to the high resolution image
  • Change image URL to point to the low resolution image
  • Link the attachment to the post

I'm thinking that a separate class that takes a of URLs and handles the downloading and creation of the attachment passing the details back to the caller. It should theoretically be possible to use that in any importer?

I'm thinking that the changes would be along the lines of

  • New class that handles the image processes as described above
  • WP_SimplePie_BlogItem would be extended to include a get_images function returning a collection of images, possibly in pairs of image and link?
  • Blogger_Importer->import_blog change to pass these images to the new class to process them
  • Blogger_Importer->import_blog change to store the results in BloggerEntry
  • Blogger_Importer->import_blog change to do a find and replace on URLs in the content
  • Blogger_Importer->import_post changed to connect the attachement details to the post

Perhaps we could also keep a count of the attachements downloaded?

Last edited 23 months ago by Workshopshed (previous) (diff)

comment:30 Workshopshed23 months ago

I've looked at the replace_urls in the SimplePie sanitize class and can't work out what it's purpose is. I was expecting it would do something like changing the domain from it's old location to the new domain?

comment:31 in reply to: ↑ 29 SergeyBiryukov23 months ago

Replying to Workshopshed:

Can I confirm that the correct process to handle the images would be something like the following?

Perhaps media_sideload_image() could be used there?

comment:32 follow-up: Otto4223 months ago

media_sideload_image is indeed the correct way, and it's the method we use in the Tumblr importer. It's a bit finicky though, so if you want to implement that, watch your error handling carefully. Sometimes you just can't get the image and have to leave it where it is.

comment:33 in reply to: ↑ 32 Workshopshed23 months ago

Replying to Otto42:

media_sideload_image is indeed the correct way

Cheers Otto, Sergey, I can see that media_sideload_image does do several of the things needed but it also seems a bit restrictive, I can't work out how I'd use that get both the medium and high resolution images as the same attachment and returning the html seems a bit odd. However it should be possible to use media_handle_sideload to do the bulk of the job so thanks for pointing me in the right direction.

comment:34 Otto4223 months ago

You don't get both the medium and high resolution images. You just get the highest resolution image, as WordPress itself will then create the lower sized versions.

comment:35 Workshopshed23 months ago

Sounds good, will do some experiments.

comment:36 Otto4222 months ago

  • Resolution set to fixed
  • Status changed from assigned to closed

All the given changes have been pushed into the blogger importer version 0.5. This issue in particular (prepending >'s) should be fixed. New tickets for new issues, plz.

Note: See TracTickets for help on using tickets.