Make WordPress Core

Opened 18 years ago

Closed 11 years ago

Last modified 11 years ago

#4010 closed enhancement (fixed)

Add Image Importing to the Blogger Importer

Reported by: clwill's profile clwill Owned by: workshopshed's profile Workshopshed
Milestone: WordPress.org Priority: normal
Severity: normal Version:
Component: Import Keywords:
Focuses: Cc:

Description

The new blogger importer currently does a great job of bringing the blog over to WP, but it leaves the images associated with the blog on blogger.com and/or blogspot.com. This violates blogger's TOS and risks having the user's image links blocked by blogger. This change will move those images (using the WP image upload facility, of course) to the user's blog and fix the links as the import is done.

Attachments (4)

blogger-importer.2.zip (31.1 KB) - added by Workshopshed 12 years ago.
blogger-importer.zip (93.1 KB) - added by Workshopshed 11 years ago.
Moved google connection into separate class to help with testing of this plugin
blogger-importer.3.zip (94.5 KB) - added by Workshopshed 11 years ago.
Added missing file
blogger-importer.4.zip (93.4 KB) - added by Workshopshed 11 years ago.

Download all attachments as: .zip

Change History (65)

#1 @foolswisdom
17 years ago

  • Milestone changed from 2.3 to 2.5 (future)

#2 @bryan868
16 years ago

+1

I believe the importer also strips out all embedded content (video, audio, etc). That shouldn't happen!

#3 @Otto42
16 years ago

Extra +1. The WordPress importer can do this when importing from WordPress.com, the Blogger importer should do the same thing.

#4 @DD32
16 years ago

  • Component changed from Administration to Import
  • Keywords needs-patch added; import removed

#5 @Denis-de-Bernardy
15 years ago

  • Keywords changed from blogger images needs-patch to needs-patch blogger images
  • Milestone changed from 2.9 to Future Release

#6 @SergeyBiryukov
12 years ago

  • Milestone changed from Future Release to WordPress.org

#7 @Workshopshed
12 years ago

  • Cc workshopshed@… added

#8 @Workshopshed
12 years ago

I've added a new version of the importer that loads images. It's a lot slower than the previous version, the performance issue is kind of where you'd expect it to be in the downloading and processing of the files.
Any comments or feedback would be appreciated.

Known issue with this version in that it does not properly handle source files with special characters in them for example Test%2BFile.jpg

Known issue in that it does not handle images that are linked to the HTML page e.g. S800-H in the URL, the function is in place but not yet written

Last edited 12 years ago by Workshopshed (previous) (diff)

#9 @wordpresssites
12 years ago

So it does import images? The read me text says it doesn't. Please clarify.

Is there an importer which imports image files (Not just the file urls) from WordPress.com to WordPress.org?

#10 @Workshopshed
12 years ago

The 0.5 version does not import images. The 0.6 version attached to this ticket does import images although is a beta and has known issues.

What it does is to scan your post content for images with HTML tags of the form

<a href="xxx"><img src="yyy"></a> or just <img src="yyy">

then downloads the images from those locations. These are uploaded to your site's host in the same way that a manually attached image would be processed.

A good point that the readme for the beta version needs to be updated.

You should raise your question about importing from Wordpress.com to the Wordpress Importer this ticket is specifically about the Blogger Importer.

#11 @Workshopshed
12 years ago

Latest Beta, fixed a couple of issues, firstly the images that have filenames of the form /s320-h/ which link to the HTML rather than the images. Also filenames with things like %2B in them which seems to cause a problem with Wordpress(or perhaps just my hosting).

Also a setting to turn off the image importing as it can be a bit slow.

Known issues: Does not set the author of the image attachments, does not get the highest resolution image possible, only the size included in the img or link tags.

Last edited 12 years ago by Workshopshed (previous) (diff)

#12 @clwill
12 years ago

Can some appropriately privileged person please make Workshopshed the owner of this ticket?

Thanks,
Chris

#13 @SergeyBiryukov
12 years ago

  • Keywords blogger images removed
  • Owner changed from clwill to Workshopshed
  • Priority changed from low to normal
  • Status changed from new to assigned

#14 @Workshopshed
12 years ago

Thanks Clwill, Sergey. The issue that Bryan mentioned above had re-surfaced with the latest version of Wordpress, that's fixed in the latest attachement.

#15 @Workshopshed
12 years ago

Sorry, accidentally added a second attachment "blogger-importer.2.zip​". I've no idea how to remove that.

#16 @Workshopshed
12 years ago

The latest beta now migrates images across from blogger and when you set the authors it applies that to the images too.
I've also adjusted it to get a larger image when there is no anchor link.
There's been a bit of restructuring too so if anyone is in the mood to do some regression testing on this.
There's some work in progress stuff in there too to migrate the links but it's not complete and also not the topic of this issue but it should not get it the way of testing.

#17 @Workshopshed
12 years ago

Javascript has been moved to external file and rendering of table is now done by WP class WP_List_Table so the importer now looks more like other admin screens. The progress bars are now done using JQuery.UI.Progressbar and we get a count of the images processed. Turn of post revisions in the loading process. Importer can be re-run without getting duplicates. Have added a refresh button so you can see if you've had more updates since you last imported.

Still getting false positives on files that are not actually images.

My false positive test is a wikipedia page which although having a .png extention is actually a HTML page and hence we don't want to download it.

<a href="http://en.wikipedia.org/wiki/File:Centrifugal_governor.png"><img src="suitablelocalimage.png"></a>
Last edited 12 years ago by Workshopshed (previous) (diff)

#18 @Workshopshed
12 years ago

False positive issue should be fixed.

#19 @Workshopshed
12 years ago

This version also processes internal links and swaps them out with your new wordpress style permalinks. The links and images status bars also get updated.

Still got a slight issue with those pesky Wikipedia links in that it's swapping the href when it should not.

Also want to move the image processing into it's own loop (like the link processing is done) so we can see a status bar on that.

There's also a little bit of Javascript that needs reformatting and moving over from the show_blogs to the _js_vars function on the new WP_List_Table object

Then it's pretty much done. Let me know of any issues you find with this latest one.

Cheers,

Andy

Last edited 12 years ago by Workshopshed (previous) (diff)

#20 @Workshopshed
12 years ago

Image processing in it's own loop so we get a progress bar on that now.

Partial fix to pesky Wikipedia link in that it works the first time the image is used but if you repeatedly have links for the form:

<a href="http://www.oldsite.com/linktohtmlpage.jpg">
     <img src="http://www.oldsite.com/linktoimage.jpg">
</a>

Then the first link will be correctly handled with the anchor tag not being changed as it's not a valid image. But any subsequent ones will change the anchor to point at the large version of the image already downloaded. It would be possible to fix this by always downloading the large image but as that's the slowest step it's good to avoid that.

_js_vars function is outstanding, that needs updating, as per previous comment.

New issue (which I think has been there all along) in that the status bars don't see to update reliably, the jQuery is making an Ajax call off to the status function as it's always done but that's not returning the "current" values. The result is that the progress bars are jerky, maybe updating every few seconds rather than one each call.The data returned is the data displayed it's just that the data is out of date, perhaps something to do with caching of the options?

The javascript has been "I18n"ed so you can make "%d of %d" in your own language. The pot file has not yet been updated.

#21 @Workshopshed
12 years ago

Latest version sorts out the status issue and some other issues with the Javascript to do with the timeouts, previously it was continually polling rather than waiting 3s as it was supposed to.

The readme is updated.

The _js_vars function is fixed so all the Javascript is now properly handled.

Those "pesky" wikipedia links are fixed as is the counting of duplicated images.

As suspected, the issue with the updates was to do with update_options caching values, I'm not entirely happy with the solution I came up with for those in that it's a double update, one side effect of that is sometimes the status can be incorrectly read, meaning that the bars flick to 0 then back again. If anyone has ideas the issue is documented over at #13480

Have implemented a timeout so people with problematic

The things on my todo list are:

Double check scheduled posts are properly handled

Double check larger blogs, all posts and all comments are loaded (as people have reported issues with earlier versions). My largest blog is currently 461 posts and 392 comments, I've heard people trying to migrate significantly large blogs that that.

And also importantly, get it in SVN.

Please, if anyone can test this or review the code, I'd love to have some feedback.

Last edited 12 years ago by SergeyBiryukov (previous) (diff)

#22 @Workshopshed
12 years ago

Double checked all statuses handled correctly, published, draft, scheduled working fine.
Tried a few things but no luck with the update_options issue.

Known issues with this version:

If the images have other attributes before the SRC attribute then they are not getting processed, should be just a tweek to the regex.

e.g. <img border=0 alt="new test" src="myimage.jpg">

If post revisions is turned on then problems are occuring when you re-import e.g. comments are re-imported and linked to the drafts?

The setting to turn off the post revisions is not working.

I think we need to make the high rez images to be configured to default to a smaller size, in my case 900 images comes to 300MB using the 1600 size, am thinking that 1024 would be a better default.

#23 @Workshopshed
12 years ago

First issue reported on last upload has proved not to be the case. My FTP server was out of space hence no new images were getting uploaded. It would be good to have better error reporting in that situation.

Fixed post revisions issue by using

remove_post_type_support('post', 'revisions')

Added new constant for default image download size and have defaulted that to 1024 as suggested.

Have added a screenshot to the release.

Known issues:

Flickering status due to double save of options

Still no testing on massive blogs, e.g. 32,000 posts, 80,000 comments people have mentioned on the support pages.

Last edited 12 years ago by Workshopshed (previous) (diff)

#24 @Workshopshed
12 years ago

Fixed a couple of issues with notices being displayed and a typo spotted by Jared Henderson

#25 @Workshopshed
12 years ago

Fixed issue with enqueuing of JQuery.UI.Progressbar and added error trap into javascript, thanks to Benjamin Tennant from ProPhoto blogs for help with testing.

#26 @Workshopshed
12 years ago

Latest upload, fix for issue with ob_clear not working in development mode in the ajax_die function.

I think the real solution to this is to switch over to doing ajax calls as per http://codex.wordpress.org/AJAX_in_Plugins#Ajax_on_the_Administration_Side but I'm not sure that totally fixes the issue with other plugins sending debug info before the ajax returns it's data.

#27 @rfehosting
12 years ago

I am not sure if this is an issue with the plugin itself, but while im importing it looks like this when it is updating status - http://screencast.com/t/q6grneVvjr
Then about 5-10 secs later it actually shows its current status - http://screencast.com/t/vyB7UvH2
Here is a short video of what it looks like:
http://screencast.com/t/J7rWbb9gi1

I use the general blogger importer plugin quite a lot, and would love to assit in testing it to make sure it is working.
I import large and small blogs weekly.

So if you need a contact for testing, let me know.

I am currently importing a blog with 30k comments. Seems after 5k every importer poops out, and takes FOREVER. Any ideas on how to speed that up?

Let me know what you think.

Thanks

Aaron

---
Edit: Ok i see you posted in a reply above about the status bar flickering, is that the same issue i see or something else?

Also ON your note about large blogs, I do those often with 20k-50k comments. So if you need test subjects let me know. Id love to see a working plugin that works for all blog sizes. I am so tired of having to import to wordpress.com then export and import to self hosted. And for large blogs you cant use the appspot site. So this would be the best option if it if it can be fixed to work for huge blogs like that.

Last edited 12 years ago by rfehosting (previous) (diff)

#28 @Workshopshed
12 years ago

Aaron, yes the "flickering" I mentioned is the same issue you've reported. I've added some notes to a separate issue which I think is related to the problem with large blogs, it could also be the answer to the status bar issue. http://core.trac.wordpress.org/ticket/6369#comment:8

On a technical point, what is happening is that the pressing of the button starts up 2 processes. The first runs the import and the second periodically reads the status. The status is being passed between these two in the form of writing and reading a wordpress option. I was finding that it was not being updated into the DB so I update with a dummy value and then back to the actual value, this has the side effect you are seeing. I believe this is related to a core issue #13480 but it may also be due to the size of data being stored which is the issue 6369 mentioned above.

Does it eventually finish? How long does take to get to 1000 posts and 6000 comments?

Last edited 12 years ago by SergeyBiryukov (previous) (diff)

#29 @rfehosting
12 years ago

OK so in those screenshots i was at like 9k Comments, Now i am at 10k and have to press continue. So no it is very slow, at this rate it will take me prob the rest of the day.

I have never ever been able to get this many comments to import using the plugin. I have always had to go to wp.com then over to self hosted.

I have actually tried to setup the google app engine, and use their tool, but it fails to export categories so you end up getting posts/comments just fine but they are all assigned to "uncategorized" So the end result for large blogs is bad.

So the import plugins seem to only get to 5800 comments, then after that they take years to get the rest. Or it just times out and gets stuck and you have to restart it.

#30 @Workshopshed
12 years ago

Aaron, thanks for your feedback, it does sound like a memory limit. I'll look into those changes suggested but it could take some time.

#31 @rfehosting
12 years ago

Now would this be a server memory limit or a limit in the plugin?
Right now i have the servers php settings:
memory_limit 256M
upload_max_filesize 64M

Do you recommend higher, or other values for PHP for larger blogs?

#32 @Workshopshed
12 years ago

I'm thinking that it's running out of memory so it would be that 256MB setting. If you could make that either bigger or smaller and see how many comments it processes before stopping then that would tell us if the hypothesis is correct.

#33 @rfehosting
12 years ago

So quick update.
I set PHP's mem to 1gig of ram, and it still seems to timeout and just stop with no continue button. So refreshing the blog list makes the import re-appear then i can continue it.

So i am not sure it is a ram issue.

Also, I cant even get wp to even use that much ram. I'm watching the servers resources and it wont go over like 250 meg, even though its set for 1gig. So wonder if there is a hard/max limit set in WP somewhere?
--

And to add to it, it seems that after 5k or so comments it just has issues period, maybe there is a hard limit somewhere that can be adjusted to make it import smoother, and bring in the rest of the comments?

Last edited 12 years ago by rfehosting (previous) (diff)

#34 @Workshopshed
12 years ago

Fix for issue where images are suffixed with ? or # parameters e.g. ?imgmax=800

#36 @rfehosting
12 years ago

  • Status changed from assigned to closed

Sounds good, ill give it a go, and report back.

#37 @rfehosting
12 years ago

  • Status changed from closed to reopened

Wups, hmm shouldn't allow me access to close this ticket.

Ok so now comments wont import at all with your new update. It just skips onto images and links.

#38 @Workshopshed
12 years ago

I did think you were being a bit hasty closing the ticket.

Did you delete all the previously imported posts first and clear down the trash?

Normally you'd not need to do that but a change was made that means the posts need to be re-imported so that it can record the blogger internal ID for those in the Meta data. That will then be used so that when the comments are imported the importer knows which post to link them to.

Thanks again for your help testing, sorry this change is a bit of a pain.

Last edited 12 years ago by Workshopshed (previous) (diff)

#39 @rfehosting
12 years ago

WOW you are amazing. It pulled them all in with one full swoop. Didn't have to press continue at all. Never had a blog this large get imported this quick.

So i ran it on my test server, and it worked flawlessly.
I wonder why no one ever thought of that in the past.

So looks like comments and posts are working well. And it did links also, so will be awesome once images are working good too.

Let me know if you need any further testing.

Thanks
Aaron

#40 @Workshopshed
12 years ago

Thanks, we are getting there. I think the images might be working mostly ok, I've had a few images that seem to need more than one pass to import.

The status/progress display is not looking so good for this version, for example I imported 700 images but the progess said that on;y 1 had been done.

#41 @rfehosting
12 years ago

Ahh yes, i got that too. It said it had 66 images, but pulled in far more then that, but still not all of them it should have.

#42 @Workshopshed
12 years ago

If you could provide the <a href=""><img src=""></a> details of one of the files that did not get processed I'll see if there was something specific about it that caused it not to process.

#43 @rfehosting
12 years ago

Still working on that test for you.

One question, is it possible to disable the link updates? I am not 100% sure what that does, but it seems to break the sites permalinks. I use a Plugin called "Maintain Blogger Permalinks" And this shortens the permalinks for all the posts to match how blogger had them. But it seems when the link update runs in your plugin it changes the metadata so the plugin will not work. Or, can you adjust the plugin to work in the same matter? What does the links update do?

This plugin i use is not avail anymore that I know of, so i can send it over if needed so you can find out what it does.

For Example, the blogger permalink is:
/2013/04/chevron-easter-dress-and-1-year.html
Yet it gets pulled into wordpress as:
/2013/04/chevron-easter-dress-and-1-year-blogiversary.html

So what the plugin does, is makes it match, as they somehow know how blogger would build the link, so after importing, i run that plugin and it shortens it to:
/2013/04/chevron-easter-dress-and-1-year.html

Thanks

Last edited 12 years ago by rfehosting (previous) (diff)

#44 @Workshopshed
12 years ago

In the last version I made a change to the way the meta data is stored.

What the link processing does is to look for links that match the "blogger_permalink" value and update it with the new permalink of the post that Wordpress has assigned. Any links that match that of the old site but don't match a permalink get mapped across with the new site name e.g.

mysite.blogspot.com/p/about.html -> mynewsite.com/p/about.html

What I do for my sites is to use that meta data and import it into the "redirection" plugin so that anyone who's linked an old URL will get directed across to the new location, otherwise they might get a 404 error. My scripts also rely on the correct data being stored in "blogger_permalink". I also handle the special cases such as the pages (as above) and searches with the redirection plugin.

Here's some notes on that: http://fleacircusdir.livejournal.com/4913.html

So I've corrected the importer to store the permalinks as it did before.

To answer your other question, to disable the processing of links find a couple of lines around 380 in the function import_blog in the file blogger-import.php and comment them out.

            if (!$this->process_links()) 
                self::ajax_die('continue');
Version 2, edited 12 years ago by Workshopshed (previous) (next) (diff)

#45 @rfehosting
12 years ago

Thanks for that. Ya looks like there is no need if the blogger_permalink is still there. So thanks for that.

So far no results on the image imports, it keeps timing out of on me so just working on tweaking the timeout settings to find its sweet spot so i can at least get the continue button to show.

Thanks for all your help. Ill let you know once i get it imported.

#46 @Workshopshed
12 years ago

The timeout at the top of the file blogger-import.php is for google's data APIs only, not the downloading of the images.

If you are getting timeouts on the images then you need to look down at line 1196 and 1208 in the function import_image.

That's where you will find "download_url", the default timeout is 300s and can be overwritten by adding a separate parameter.

http://codex.wordpress.org/Function_Reference/download_url

#47 @rfehosting
12 years ago

Great thanks for that.

#48 @Workshopshed
11 years ago

Ok, so 7 months since I last had a fully operational version, that's why I did not upload anything.

I've restructured the code so that the "blogs" are a separate class. This allows me to save these separately as options and switch over to use proper ajax techniques.

http://codex.wordpress.org/AJAX_in_Plugins#Ajax_on_the_Administration_Side

The good news is that this has fixed the problem where the status was not refreshing correctly, particularly when processing the images.

If you look at the code at the moment it's storing the token and secret against each of the blogs which is not good. There is one last restructure I need to do which is to create a class that handles all of the connections to google in one place.

#49 @Workshopshed
11 years ago

Got a small issue to resolve where posts in the trash cause the image processing to loop indefinitely.

Last edited 11 years ago by Workshopshed (previous) (diff)

@Workshopshed
11 years ago

Moved google connection into separate class to help with testing of this plugin

#50 @Workshopshed
11 years ago

Restructure the code to remove the issue from my last upload where the secret and token from the authentication were being passed all over. And to provide a single location (except for image download) that provides all of the data to the plugin. It should be possible to mock this now for unit testing.

#51 @Znuff
11 years ago

Your latest version doesn't seem to be working for me, looks like it's missing a file:

Warning: require_once(/home/wwwfumar/public_html/wp-content/plugins/blogger-importer/blogger-entry.php): failed to open stream: No such file or directory in /home/wwwfumar/public_html/wp-content/plugins/blogger-importer/blogger-importer.php on line 29
Last edited 11 years ago by Znuff (previous) (diff)

@Workshopshed
11 years ago

Added missing file

#52 @Workshopshed
11 years ago

Sorry Znuff, see blogger-importer.3.zip

#53 @Workshopshed
11 years ago

Fixed issue with OpenSSL and displaying of error messages.

#54 @Workshopshed
11 years ago

I don't have SVN access for the Blogger Importer and I don't really want to completely fork this into a "new" plugin so that's why I've kept attaching the code as a zip file.

However, rather than keep uploading zip files I've put it all on github.

https://github.com/Workshopshed/BloggerImporter

#55 @Otto42
11 years ago

I can give you SVN access to update the plugin directly.

Any objections to this?

#56 @nacin
11 years ago

None. Workshopshed, please just run any new releases by Otto42 or a core dev.

#57 @Workshopshed
11 years ago

Cheers, feel free to throw any code reviewers in my direction.

Would be nice to get a release out as long as it does not contain any show stoppers.

#58 @Otto42
11 years ago

Workshopshed: You have access now. The blogger importer uses the Tags system, so upload your latest dev code to trunk, but don't change the Stable Tag in the readme.txt and then it won't get released until it's ready.

Having the latest beta code in trunk allows plugins like https://wordpress.org/plugins/plugin-beta-tester/ to work.

#60 @Workshopshed
11 years ago

  • Resolution set to worksforme
  • Status changed from reopened to closed

The 0.7 version with the images is now out there in the wild.

On my site the plugin beta tester does not seem to do what it's expected to do.

#61 @SergeyBiryukov
11 years ago

  • Keywords needs-patch removed
  • Resolution changed from worksforme to fixed
Note: See TracTickets for help on using tickets.