Make WordPress Core

Opened 14 years ago

Closed 14 years ago

#16236 closed task (blessed) (fixed)

Method to stream file downloads in HTTP API, to reduce update memory footprint

Reported by: markjaquith's profile markjaquith Owned by: sivel's profile sivel
Milestone: 3.2 Priority: normal
Severity: normal Version:
Component: HTTP API Keywords: needs-testing
Focuses: Cc:

Description (last modified by dd32)

Method to stream file downloads in HTTP API, to reduce update memory footprint

Attachments (14)

16236.diff (5.5 KB) - added by sivel 14 years ago.
Some early code to support stream to file using cURL
16236.2.diff (6.2 KB) - added by sivel 14 years ago.
16236.3.diff (7.2 KB) - added by sivel 14 years ago.
Add in streams support
16236.4.diff (9.1 KB) - added by sivel 14 years ago.
Add fsockopen support
16236.5.diff (9.2 KB) - added by sivel 14 years ago.
Performance improvement for fsockopen, to prevent inspecting every block after the headers are finished
16236.6.diff (9.7 KB) - added by sivel 14 years ago.
Spool fsockopen output to a string until we get all of the headers.
16236.7.diff (12.3 KB) - added by sivel 14 years ago.
Docs, cleanup and blacklist ExtHttp transport for streaming to file
16236.8.diff (13.0 KB) - added by sivel 14 years ago.
16236.9.diff (14.5 KB) - added by sivel 14 years ago.
Fixes bug where headers were added to stream file, and update download_url() to use streaming.
16236.10.diff (15.5 KB) - added by sivel 14 years ago.
Add pseudo streaming to file with the HTTP Extension.
16236.11.diff (15.5 KB) - added by sivel 14 years ago.
Adds streaming for HTTP Ext and removes the blacklist from the test()
16236.12.diff (14.9 KB) - added by sivel 14 years ago.
Now that we are using CURLOPT_HEADERFUNCTION, use that for all header processing in cURL
16236.13.diff (15.0 KB) - added by dd32 14 years ago.
16236.14.diff (14.6 KB) - added by sivel 14 years ago.
Remove wp_list_pluck code, and fix body contents for cURL when streaming to a file

Download all attachments as: .zip

Change History (55)

#1 @dd32
14 years ago

  • Component changed from General to HTTP
  • Description modified (diff)
  • Keywords 3.2-early added

fsockopen, fopen: fread()+fwrite() loop, easy

streams: stream_copy_to_stream($http_handle, $new_file_handle);

HTTP Extension: I cant see a case there which allows for streaming directly to files, So will probably have to read into memory, and write it out.

cURL: Can use CURLOPT_FILE to specify where to write the output, CURLOPT_WRITEHEADER or CURLOPT_HEADERFUNCTION should be able to be used to capture the header data

Some of the classes might be able to be unset or destroyed better to decrease memory usage after a request is made.

@sivel
14 years ago

Some early code to support stream to file using cURL

#2 @sivel
14 years ago

Just attached some code to add support for performing this operation from cURL. Initial testing shows a reduced memory footprint, that is not affected by the size of the file we are downloading.

#3 @sivel
14 years ago

This will need some documentation, and we will likely want to discuss how we handle the CURLOPT_HEADERFUNCTION, right now I just did it the easiest way I could come up with.

#4 follow-up: @dd32
14 years ago

There's no need to force it to a GET request when saving the response, It could just as easily be the response from a POST request.

#5 in reply to: ↑ 4 @sivel
14 years ago

Replying to dd32:

There's no need to force it to a GET request when saving the response, It could just as easily be the response from a POST request.

Agreed. I didn't expect this to go in, in its current form. But definitely something i didn't take into consideration. I'll start working on the other transports as time permits and continue cleaning up the code

#6 follow-up: @dd32
14 years ago

No problem :)

A few other thoughts:

  • We should probably let the caller specify the destination folder AND filename, save the class guessing where to put the file.
  • The return could be modified to have headers, body(empty - for backcompat purposes), and (optionally) a filename/path field for when it's been saved
  • The temp directory function could probably be moved to a non-admin spot like that, it'll help plugins expecting to find it on cron tasks/etc.

#7 in reply to: ↑ 6 @sivel
14 years ago

Replying to dd32:

A few other thoughts:

  • We should probably let the caller specify the destination folder AND filename, save the class guessing where to put the file.

I wonder about allowing the user to specify the destination directory. As with the uploads dir we enforce that it must be relative to the wp-content dir. Perhaps if specified we force it to be relative to wp-content. Should we also make it clear that the directory should exist first? Not sure if we want to go about attempting to make the directory or not.

  • The return could be modified to have headers, body(empty - for backcompat purposes), and (optionally) a filename/path field for when it's been saved

Should we introduce a new section to the return? Or just add it to the 'response' section? Currently headers should be populated, and an empty body should be returned.

#8 @dd32
14 years ago

I wonder about allowing the user to specify the destination directory.

I'm thinking of when WordPress is used as a external library in another project, or included on the CLI. In those cases, you don't want to be limited to the WordPress directory.

Given it's a developer function, Passing an absolute path seems appropriate to me.

Should we also make it clear that the directory should exist first?

IMO, it should be assumed the folder is readable (and well, error if we're not able to open the specified file for writing), We do have the option of wp_mkdir_p(), but once again, to decrease complexity, I see no reason not to trust the caller has done their homework.

Should we introduce a new section to the return? Or just add it to the 'response' section? Currently headers should be populated, and an empty body should be returned.

I'm thinking of a response like this:

array
  'headers' => 
    array
      'date' => string 'Sat, 12 Feb 2011 05:47:20 GMT' (length=29)
      'server' => string 'Apache/2.2.14 (Win32) PHP/5.3.2' (length=31)
      <snip>
  'body' => string '' (length=0)
  'response' => 
    array
      'code' => int 200
      'message' => string 'OK' (length=2)
  'cookies' => 
    array
      empty
  'filename' => '/blahblahblah/file.zip'

But if we only accept a absolute filepath then having the filename in the return might be overkill.. but I still feel it'd be nice to include there somewhere. It doesnt fit into the other sections, so I see no reason not to give it it's own element.

@sivel
14 years ago

#9 @sivel
14 years ago

A few improvements based on feedback. Don't force GET, allow for specification of full file path, and return the file path with the rest of the info about the request.

@sivel
14 years ago

Add in streams support

#10 @sivel
14 years ago

New patch adds support for the streams transport.

@sivel
14 years ago

Add fsockopen support

#11 @sivel
14 years ago

New patch adds support for the fsockopen transport.

Fopen is going away in WP 3.2, so no need to do anything there. Need to evaluate how to do this in the HTTP Extension. Perhaps for the ExtHttp transport we just blacklist it for streaming to a file.

@sivel
14 years ago

Performance improvement for fsockopen, to prevent inspecting every block after the headers are finished

#12 @sivel
14 years ago

New patch improves performance over 16236.4.diff. We don't need to continue inspecting the block after headers have been fully sent for the fsockopen transport.

#13 follow-up: @dd32
14 years ago

if ( strpos( $block, "\r\n\r\n" ) ) {

Is there a chance that that could fall over 2 blocks, ie. \r\n in one block, \r\n in the next?

My solution for that, would be

if ( $bodystarted ) {
write to file
} else {
write to headers
check headers for \r\n
 split headers
 write a bit to file
 bodystarted = true;
 NEXT!
}

#14 in reply to: ↑ 13 @sivel
14 years ago

Replying to dd32:

if ( strpos( $block, "\r\n\r\n" ) ) {

Is there a chance that that could fall over 2 blocks, ie. \r\n in one block, \r\n in the next?

You are correct. I guess technically we could only have \r in one block and the rest in the following block...Maybe we just spool up everything into a string until we find \r\n\r\n then bust it up and start writing to the proper places.

#15 @dd32
14 years ago

Maybe we just spool up everything into a string

Pretty much what I said in psuedocode, Except, We can use the header string until the body is located, and then just pull the (small piece of) body out of the header string

@sivel
14 years ago

Spool fsockopen output to a string until we get all of the headers.

#16 @sivel
14 years ago

New patch will spool output to a string until we are sure that we have all of the headers, and then bust it up and start writing to the right places. Added fclose for all $stream_handler's and added some var cleanup.

@sivel
14 years ago

Docs, cleanup and blacklist ExtHttp transport for streaming to file

#17 @sivel
14 years ago

  • Keywords has-patch added

New patch has docs, code cleanup and blacklist for ExtHttp transport for streaming to file. Need to update download_url() and do some testing.

#18 @sivel
14 years ago

  • Owner set to sivel
  • Status changed from new to accepted

#19 @dd32
14 years ago

More patch review:

if ( ! is_dir( dirname( $r['filename'] ) ) || ! is_writable( dirname( $r['filename'] ) ) )

I think the is_dir() should be able to be droped here, Just checking the parent folder of the file to be created is writable should suffice? Only case i can think of would be $filename = FILE . '/somefile.zip'; which would fail without the usage of is_dir..

if ( $r['stream'] )

$stream_handle = fopen( $r['filename'], 'w+' );

$stream_handle needs to be checked that it's a valid resource, ie. that opening the file was possible, is_writable may suceed, but i have a feeling that certain security functionalities can get in the way of that (For example, File already exists and is not writable)

isset( $args['stream'] ) && $args['stream']

!empty($args['stream']) ? false || 0 || '' || !isset() is taken as empty, so should be able to simplify that block down a bit

@sivel
14 years ago

#20 @sivel
14 years ago

New patch covers items mentioned by dd32 above. Also following the method used for fopen elsewhere in the HTTP API, we error suppress the fopen unless WP_DEBUG is enabled.

@sivel
14 years ago

Fixes bug where headers were added to stream file, and update download_url() to use streaming.

#21 @sivel
14 years ago

New patch fixes a bug with cURL, where the headers were added to the stream file and updates download_url() to use the new streaming functionality.

Due to the way that the HTTP API works currently, if the HTTP Extension is installed and used, the download will likely fail, because it was not the first request made, see #8622. This should be updated for the 3.2 release. Fopen will be going away for 3.2, so if the current default for an install is fopen it will fail as well, until we remove the fopen transport, see #13915.

#22 @dd32
14 years ago

This should probably go in after #8622 and #13915 are dealt with. I'm going to look into #8622 now and get it ready for 3.2.

#23 @dd32
14 years ago

sivel: You can try this patch which removes the possibility of a fopen transport, and also fixes it using a incompatible transport, This patch will make it easier to test these patches

#24 @sivel
14 years ago

Would it make sense to add another arg to download_url()? I am thinking in terms of media_sideload_image/wp_handle_sideload. Looks like we download to a temp directory and then later move it over to uploads. Perhaps we could cut out some code and just dump it directly in?

#25 @dd32
14 years ago

we have a few other download-to-file functions such as wp_get_http() I'd like to deprecate all the various wrappers and merge it into the HTTP API functions instead.

ie. Instead of calling download_url(), call wp_remote_get(..stream=path);

@sivel
14 years ago

Add pseudo streaming to file with the HTTP Extension.

@sivel
14 years ago

Adds streaming for HTTP Ext and removes the blacklist from the test()

#26 @sivel
14 years ago

Add pseudo streaming to file with the HTTP Extension. This saves us nothing on the memory allocation when using the HTTP Extension but at least gives us the same functionality that the other transports have. This came from a conversation with Nacin at WordCamp Miami.

If it hasn't been done in another patch, we need to move the HTTP Extension as the last one in priority. No reason to keep it at the front since it is the least tested and now supports less functionality.

@sivel
14 years ago

Now that we are using CURLOPT_HEADERFUNCTION, use that for all header processing in cURL

#27 @sivel
14 years ago

Now that we are using CURLOPT_HEADERFUNCTION, use that for grabbing all of the headers in the curl transport instead of using substr and explode, to separate the headers and body.

@dd32
14 years ago

#28 @dd32
14 years ago

attachment 16236.13.diff added

Patch refresh after the latest class-http.php changes.

The change to using CURLOPT_HEADERFUNCTION was causing the HEAD request unit tests to fail, so I re-worked the curl logic.

sivel, can you test that to make sure it still works as expected?

#29 @dd32
14 years ago

  • Keywords 3.2-early removed
  • Milestone changed from Future Release to 3.2

This is going in once it's been fully tested and benchmarked.

#30 @dd32
14 years ago

Added a unit test to verify the download sizes, seems right, Headers are parsed correctly still too.

Just tested on 2 systems: (Note, this is not a unit test output, a functional benchmark output instead)

 exthttp had an error: There are no HTTP transports available which can complete the requested request.
 curl consumed 7856 Bytes of memory to download 3,096.07kB, took 6 seconds, @ 516.01 K/s
 streams consumed 584 Bytes of memory to download 3,096.07kB, took 7 seconds, @ 442.3 K/s
 fsockopen consumed 608 Bytes of memory to download 3,096.07kB, took 6 seconds, @ 516.01 K/s

 exthttp consumed 14352 Bytes of memory to download 3,096.07kB, took 9 seconds, @ 344.01 K/s
 curl consumed 544 Bytes of memory to download 3,096.07kB, took 90 seconds, @ 34.4 K/s (Note: This slow request couldn't be duplicated, probably some local bottleneck)
 streams had an error: There are no HTTP transports available which can complete the requested request.
 fsockopen consumed 2976 Bytes of memory to download 3,096.07kB, took 6 seconds, @ 516.01 K/s

Those numbers seem off, So i've re-run it with the 'real_memory' param of memory_get_usage() enabled, and all of them aside from HTTP extension returned 0 bytes used:

exthttp consumed 262144 Bytes of memory to download 3,096.07kB, took 8 seconds, @ 387.01 K/s

Overall, we're going to expect the HTTP Extension to use the most memory due to the implementation, however my testing systems don't seem to be able to give me actual memory usage figures..

Will test on another system tomorrow and commit.

#31 @dd32
14 years ago

would've helped if I used the right function (memory_get_peak_usage())

results are similar, <2000 extra bytes used for all transports aside for the HTTP Extension:

exthttp consumed 9,314,240 Bytes of memory to download 3,096.07kB, took 8 seconds, @ 387.01 K/s

That's the difference between the peak memory usage before a wp_remote_request() and after it completed.

#32 @sivel
14 years ago

@dd32, I'll try and test soon. I am not in a place where I can test currently. I am however looking over the refreshed patch and see code changes for wp_list_pluck, which I assume should not have made it into this patch.

#33 @sivel
14 years ago

The results of my testing look good. GET and HEAD requests using cURL seem to populate headers correctly. Here are my results from streaming the wordpress-3.1.zip directly to a file:

Testing curl:

array(5) {
  ["headers"]=>
  array(8) {
    ["server"]=>
    string(5) "nginx"
    ["date"]=>
    string(29) "Thu, 24 Mar 2011 18:43:50 GMT"
    ["content-type"]=>
    string(15) "application/zip"
    ["connection"]=>
    string(5) "close"
    ["pragma"]=>
    string(8) "no-cache"
    ["cache-control"]=>
    string(7) "private"
    ["content-description"]=>
    string(13) "File Transfer"
    ["content-disposition"]=>
    string(38) "attachment; filename=wordpress-3.1.zip"
  }
  ["body"]=>
  bool(true)
  ["response"]=>
  array(2) {
    ["code"]=>
    int(200)
    ["message"]=>
    string(2) "OK"
  }
  ["cookies"]=>
  array(0) {
  }
  ["filename"]=>
  string(34) "/Users/matt/tmp/wordpress/curl.zip"
}
Calculated MD5 (76904d048685a5642fe322e450abb388) matches MD5 (76904d048685a5642fe322e450abb388) of ZIP
Max Memory Usage: 16.821MB

Testing streams:

array(5) {
  ["headers"]=>
  array(8) {
    ["server"]=>
    string(5) "nginx"
    ["date"]=>
    string(29) "Thu, 24 Mar 2011 18:44:04 GMT"
    ["content-type"]=>
    string(15) "application/zip"
    ["connection"]=>
    string(5) "close"
    ["pragma"]=>
    string(8) "no-cache"
    ["cache-control"]=>
    string(7) "private"
    ["content-description"]=>
    string(13) "File Transfer"
    ["content-disposition"]=>
    string(38) "attachment; filename=wordpress-3.1.zip"
  }
  ["body"]=>
  string(0) ""
  ["response"]=>
  array(2) {
    ["code"]=>
    string(3) "200"
    ["message"]=>
    string(2) "OK"
  }
  ["cookies"]=>
  array(0) {
  }
  ["filename"]=>
  string(37) "/Users/matt/tmp/wordpress/streams.zip"
}
Calculated MD5 (76904d048685a5642fe322e450abb388) matches MD5 (76904d048685a5642fe322e450abb388) of ZIP
Max Memory Usage: 16.821MB

Testing fsockopen:

array(5) {
  ["headers"]=>
  array(8) {
    ["server"]=>
    string(5) "nginx"
    ["date"]=>
    string(29) "Thu, 24 Mar 2011 18:44:21 GMT"
    ["content-type"]=>
    string(15) "application/zip"
    ["connection"]=>
    string(5) "close"
    ["pragma"]=>
    string(8) "no-cache"
    ["cache-control"]=>
    string(7) "private"
    ["content-description"]=>
    string(13) "File Transfer"
    ["content-disposition"]=>
    string(38) "attachment; filename=wordpress-3.1.zip"
  }
  ["body"]=>
  string(0) ""
  ["response"]=>
  array(2) {
    ["code"]=>
    string(3) "200"
    ["message"]=>
    string(2) "OK"
  }
  ["cookies"]=>
  array(0) {
  }
  ["filename"]=>
  string(39) "/Users/matt/tmp/wordpress/fsockopen.zip"
}
Calculated MD5 (76904d048685a5642fe322e450abb388) matches MD5 (76904d048685a5642fe322e450abb388) of ZIP
Max Memory Usage: 16.821MB

Testing http_extension:

array(5) {
  ["headers"]=>
  array(9) {
    ["server"]=>
    string(5) "nginx"
    ["date"]=>
    string(29) "Thu, 24 Mar 2011 18:44:40 GMT"
    ["content-type"]=>
    string(15) "application/zip"
    ["transfer-encoding"]=>
    string(7) "chunked"
    ["connection"]=>
    string(5) "close"
    ["pragma"]=>
    string(8) "no-cache"
    ["cache-control"]=>
    string(7) "private"
    ["content-description"]=>
    string(13) "File Transfer"
    ["content-disposition"]=>
    string(38) "attachment; filename=wordpress-3.1.zip"
  }
  ["body"]=>
  string(0) ""
  ["response"]=>
  array(2) {
    ["code"]=>
    int(200)
    ["message"]=>
    string(2) "OK"
  }
  ["cookies"]=>
  array(0) {
  }
  ["filename"]=>
  string(44) "/Users/matt/tmp/wordpress/http_extension.zip"
}
Calculated MD5 (76904d048685a5642fe322e450abb388) matches MD5 (76904d048685a5642fe322e450abb388) of ZIP
Max Memory Usage: 25.63MB

As expected exthttp shows higher memory usage.

As mentioned in a previous comment your latest patch has some extra code changes for wp_list_pluck, that should probably be stripped before commit.

#34 @sivel
14 years ago

Looking back over the results it seems that the cURL transport is setting the body as true instead of ""

@sivel
14 years ago

Remove wp_list_pluck code, and fix body contents for cURL when streaming to a file

#35 @sivel
14 years ago

The updated patch removes the wp_list_pluck code and fixes the body contents for cURL when streaming to a file since curl_exec will return true in such a case over writing the empty body.

#36 @dd32
14 years ago

Looking back over the results it seems that the cURL transport is setting the body as true instead of ""

Good catch!

I was aware of the list_pluck changes, Was intending on striping them out of the patch but forgot!

#37 @dd32
14 years ago

(In [17555]) First run of introducing Stream-To-File for the WP_HTTP API. Reduces memory consumption during file downloads. Implemented in download_url() for upgraders. Props sivel. See #16236

#38 @dd32
14 years ago

(In [17563]) Correct logic for cURL Errors, add extra sanity protection in ::processHeaders to guard against null inputs. See #16236

#39 @dd32
14 years ago

Need to make sure the issues raised in #16057 are dealt with here

#40 @dd32
14 years ago

  • Keywords needs-testing added; has-patch removed

Need to make sure the issues raised in #16057 are dealt with here

Just uploaded a patch on #16057 for this, could do with some eyes

attachment 16057.htttp.diff added

  • Implements checking that the written file size matches the content-length header if set.

#41 @ryan
14 years ago

  • Resolution set to fixed
  • Status changed from accepted to closed
Note: See TracTickets for help on using tickets.