Opened 14 years ago
Closed 14 years ago
#16236 closed task (blessed) (fixed)
Method to stream file downloads in HTTP API, to reduce update memory footprint
Reported by: | markjaquith | Owned by: | sivel |
---|---|---|---|
Milestone: | 3.2 | Priority: | normal |
Severity: | normal | Version: | |
Component: | HTTP API | Keywords: | needs-testing |
Focuses: | Cc: |
Description (last modified by )
Method to stream file downloads in HTTP API, to reduce update memory footprint
Attachments (14)
Change History (55)
#1
@
14 years ago
- Component changed from General to HTTP
- Description modified (diff)
- Keywords 3.2-early added
#2
@
14 years ago
Just attached some code to add support for performing this operation from cURL. Initial testing shows a reduced memory footprint, that is not affected by the size of the file we are downloading.
#3
@
14 years ago
This will need some documentation, and we will likely want to discuss how we handle the CURLOPT_HEADERFUNCTION, right now I just did it the easiest way I could come up with.
#4
follow-up:
↓ 5
@
14 years ago
There's no need to force it to a GET request when saving the response, It could just as easily be the response from a POST request.
#5
in reply to:
↑ 4
@
14 years ago
Replying to dd32:
There's no need to force it to a GET request when saving the response, It could just as easily be the response from a POST request.
Agreed. I didn't expect this to go in, in its current form. But definitely something i didn't take into consideration. I'll start working on the other transports as time permits and continue cleaning up the code
#6
follow-up:
↓ 7
@
14 years ago
No problem :)
A few other thoughts:
- We should probably let the caller specify the destination folder AND filename, save the class guessing where to put the file.
- The return could be modified to have headers, body(empty - for backcompat purposes), and (optionally) a filename/path field for when it's been saved
- The temp directory function could probably be moved to a non-admin spot like that, it'll help plugins expecting to find it on cron tasks/etc.
#7
in reply to:
↑ 6
@
14 years ago
Replying to dd32:
A few other thoughts:
- We should probably let the caller specify the destination folder AND filename, save the class guessing where to put the file.
I wonder about allowing the user to specify the destination directory. As with the uploads dir we enforce that it must be relative to the wp-content dir. Perhaps if specified we force it to be relative to wp-content. Should we also make it clear that the directory should exist first? Not sure if we want to go about attempting to make the directory or not.
- The return could be modified to have headers, body(empty - for backcompat purposes), and (optionally) a filename/path field for when it's been saved
Should we introduce a new section to the return? Or just add it to the 'response' section? Currently headers should be populated, and an empty body should be returned.
#8
@
14 years ago
I wonder about allowing the user to specify the destination directory.
I'm thinking of when WordPress is used as a external library in another project, or included on the CLI. In those cases, you don't want to be limited to the WordPress directory.
Given it's a developer function, Passing an absolute path seems appropriate to me.
Should we also make it clear that the directory should exist first?
IMO, it should be assumed the folder is readable (and well, error if we're not able to open the specified file for writing), We do have the option of wp_mkdir_p(), but once again, to decrease complexity, I see no reason not to trust the caller has done their homework.
Should we introduce a new section to the return? Or just add it to the 'response' section? Currently headers should be populated, and an empty body should be returned.
I'm thinking of a response like this:
array 'headers' => array 'date' => string 'Sat, 12 Feb 2011 05:47:20 GMT' (length=29) 'server' => string 'Apache/2.2.14 (Win32) PHP/5.3.2' (length=31) <snip> 'body' => string '' (length=0) 'response' => array 'code' => int 200 'message' => string 'OK' (length=2) 'cookies' => array empty 'filename' => '/blahblahblah/file.zip'
But if we only accept a absolute filepath then having the filename in the return might be overkill.. but I still feel it'd be nice to include there somewhere. It doesnt fit into the other sections, so I see no reason not to give it it's own element.
#9
@
14 years ago
A few improvements based on feedback. Don't force GET, allow for specification of full file path, and return the file path with the rest of the info about the request.
#11
@
14 years ago
New patch adds support for the fsockopen transport.
Fopen is going away in WP 3.2, so no need to do anything there. Need to evaluate how to do this in the HTTP Extension. Perhaps for the ExtHttp transport we just blacklist it for streaming to a file.
@
14 years ago
Performance improvement for fsockopen, to prevent inspecting every block after the headers are finished
#12
@
14 years ago
New patch improves performance over 16236.4.diff. We don't need to continue inspecting the block after headers have been fully sent for the fsockopen transport.
#13
follow-up:
↓ 14
@
14 years ago
if ( strpos( $block, "\r\n\r\n" ) ) {
Is there a chance that that could fall over 2 blocks, ie. \r\n in one block, \r\n in the next?
My solution for that, would be
if ( $bodystarted ) { write to file } else { write to headers check headers for \r\n split headers write a bit to file bodystarted = true; NEXT! }
#14
in reply to:
↑ 13
@
14 years ago
Replying to dd32:
if ( strpos( $block, "\r\n\r\n" ) ) {
Is there a chance that that could fall over 2 blocks, ie. \r\n in one block, \r\n in the next?
You are correct. I guess technically we could only have \r in one block and the rest in the following block...Maybe we just spool up everything into a string until we find \r\n\r\n then bust it up and start writing to the proper places.
#15
@
14 years ago
Maybe we just spool up everything into a string
Pretty much what I said in psuedocode, Except, We can use the header string until the body is located, and then just pull the (small piece of) body out of the header string
#16
@
14 years ago
New patch will spool output to a string until we are sure that we have all of the headers, and then bust it up and start writing to the right places. Added fclose for all $stream_handler's and added some var cleanup.
#17
@
14 years ago
- Keywords has-patch added
New patch has docs, code cleanup and blacklist for ExtHttp transport for streaming to file. Need to update download_url() and do some testing.
#19
@
14 years ago
More patch review:
if ( ! is_dir( dirname( $r['filename'] ) ) || ! is_writable( dirname( $r['filename'] ) ) )
I think the is_dir() should be able to be droped here, Just checking the parent folder of the file to be created is writable should suffice? Only case i can think of would be $filename = FILE . '/somefile.zip'; which would fail without the usage of is_dir..
if ( $r['stream'] )
$stream_handle = fopen( $r['filename'], 'w+' );
$stream_handle needs to be checked that it's a valid resource, ie. that opening the file was possible, is_writable may suceed, but i have a feeling that certain security functionalities can get in the way of that (For example, File already exists and is not writable)
isset( $args['stream'] ) && $args['stream']
!empty($args['stream'])
?false || 0 || '' || !isset()
is taken as empty, so should be able to simplify that block down a bit
#20
@
14 years ago
New patch covers items mentioned by dd32 above. Also following the method used for fopen elsewhere in the HTTP API, we error suppress the fopen unless WP_DEBUG is enabled.
@
14 years ago
Fixes bug where headers were added to stream file, and update download_url() to use streaming.
#21
@
14 years ago
New patch fixes a bug with cURL, where the headers were added to the stream file and updates download_url() to use the new streaming functionality.
Due to the way that the HTTP API works currently, if the HTTP Extension is installed and used, the download will likely fail, because it was not the first request made, see #8622. This should be updated for the 3.2 release. Fopen will be going away for 3.2, so if the current default for an install is fopen it will fail as well, until we remove the fopen transport, see #13915.
#23
@
14 years ago
sivel: You can try this patch which removes the possibility of a fopen transport, and also fixes it using a incompatible transport, This patch will make it easier to test these patches
#24
@
14 years ago
Would it make sense to add another arg to download_url()? I am thinking in terms of media_sideload_image/wp_handle_sideload. Looks like we download to a temp directory and then later move it over to uploads. Perhaps we could cut out some code and just dump it directly in?
#25
@
14 years ago
we have a few other download-to-file functions such as wp_get_http() I'd like to deprecate all the various wrappers and merge it into the HTTP API functions instead.
ie. Instead of calling download_url(), call wp_remote_get(..stream=path);
#26
@
14 years ago
Add pseudo streaming to file with the HTTP Extension. This saves us nothing on the memory allocation when using the HTTP Extension but at least gives us the same functionality that the other transports have. This came from a conversation with Nacin at WordCamp Miami.
If it hasn't been done in another patch, we need to move the HTTP Extension as the last one in priority. No reason to keep it at the front since it is the least tested and now supports less functionality.
@
14 years ago
Now that we are using CURLOPT_HEADERFUNCTION, use that for all header processing in cURL
#27
@
14 years ago
Now that we are using CURLOPT_HEADERFUNCTION, use that for grabbing all of the headers in the curl transport instead of using substr and explode, to separate the headers and body.
#28
@
14 years ago
attachment 16236.13.diff added
Patch refresh after the latest class-http.php changes.
The change to using CURLOPT_HEADERFUNCTION was causing the HEAD request unit tests to fail, so I re-worked the curl logic.
sivel, can you test that to make sure it still works as expected?
#29
@
14 years ago
- Keywords 3.2-early removed
- Milestone changed from Future Release to 3.2
This is going in once it's been fully tested and benchmarked.
#30
@
14 years ago
Added a unit test to verify the download sizes, seems right, Headers are parsed correctly still too.
Just tested on 2 systems: (Note, this is not a unit test output, a functional benchmark output instead)
exthttp had an error: There are no HTTP transports available which can complete the requested request. curl consumed 7856 Bytes of memory to download 3,096.07kB, took 6 seconds, @ 516.01 K/s streams consumed 584 Bytes of memory to download 3,096.07kB, took 7 seconds, @ 442.3 K/s fsockopen consumed 608 Bytes of memory to download 3,096.07kB, took 6 seconds, @ 516.01 K/s exthttp consumed 14352 Bytes of memory to download 3,096.07kB, took 9 seconds, @ 344.01 K/s curl consumed 544 Bytes of memory to download 3,096.07kB, took 90 seconds, @ 34.4 K/s (Note: This slow request couldn't be duplicated, probably some local bottleneck) streams had an error: There are no HTTP transports available which can complete the requested request. fsockopen consumed 2976 Bytes of memory to download 3,096.07kB, took 6 seconds, @ 516.01 K/s
Those numbers seem off, So i've re-run it with the 'real_memory' param of memory_get_usage() enabled, and all of them aside from HTTP extension returned 0 bytes used:
exthttp consumed 262144 Bytes of memory to download 3,096.07kB, took 8 seconds, @ 387.01 K/s
Overall, we're going to expect the HTTP Extension to use the most memory due to the implementation, however my testing systems don't seem to be able to give me actual memory usage figures..
Will test on another system tomorrow and commit.
#31
@
14 years ago
would've helped if I used the right function (memory_get_peak_usage())
results are similar, <2000 extra bytes used for all transports aside for the HTTP Extension:
exthttp consumed 9,314,240 Bytes of memory to download 3,096.07kB, took 8 seconds, @ 387.01 K/s
That's the difference between the peak memory usage before a wp_remote_request() and after it completed.
#32
@
14 years ago
@dd32, I'll try and test soon. I am not in a place where I can test currently. I am however looking over the refreshed patch and see code changes for wp_list_pluck, which I assume should not have made it into this patch.
#33
@
14 years ago
The results of my testing look good. GET and HEAD requests using cURL seem to populate headers correctly. Here are my results from streaming the wordpress-3.1.zip directly to a file:
Testing curl:
array(5) { ["headers"]=> array(8) { ["server"]=> string(5) "nginx" ["date"]=> string(29) "Thu, 24 Mar 2011 18:43:50 GMT" ["content-type"]=> string(15) "application/zip" ["connection"]=> string(5) "close" ["pragma"]=> string(8) "no-cache" ["cache-control"]=> string(7) "private" ["content-description"]=> string(13) "File Transfer" ["content-disposition"]=> string(38) "attachment; filename=wordpress-3.1.zip" } ["body"]=> bool(true) ["response"]=> array(2) { ["code"]=> int(200) ["message"]=> string(2) "OK" } ["cookies"]=> array(0) { } ["filename"]=> string(34) "/Users/matt/tmp/wordpress/curl.zip" } Calculated MD5 (76904d048685a5642fe322e450abb388) matches MD5 (76904d048685a5642fe322e450abb388) of ZIP Max Memory Usage: 16.821MB
Testing streams:
array(5) { ["headers"]=> array(8) { ["server"]=> string(5) "nginx" ["date"]=> string(29) "Thu, 24 Mar 2011 18:44:04 GMT" ["content-type"]=> string(15) "application/zip" ["connection"]=> string(5) "close" ["pragma"]=> string(8) "no-cache" ["cache-control"]=> string(7) "private" ["content-description"]=> string(13) "File Transfer" ["content-disposition"]=> string(38) "attachment; filename=wordpress-3.1.zip" } ["body"]=> string(0) "" ["response"]=> array(2) { ["code"]=> string(3) "200" ["message"]=> string(2) "OK" } ["cookies"]=> array(0) { } ["filename"]=> string(37) "/Users/matt/tmp/wordpress/streams.zip" } Calculated MD5 (76904d048685a5642fe322e450abb388) matches MD5 (76904d048685a5642fe322e450abb388) of ZIP Max Memory Usage: 16.821MB
Testing fsockopen:
array(5) { ["headers"]=> array(8) { ["server"]=> string(5) "nginx" ["date"]=> string(29) "Thu, 24 Mar 2011 18:44:21 GMT" ["content-type"]=> string(15) "application/zip" ["connection"]=> string(5) "close" ["pragma"]=> string(8) "no-cache" ["cache-control"]=> string(7) "private" ["content-description"]=> string(13) "File Transfer" ["content-disposition"]=> string(38) "attachment; filename=wordpress-3.1.zip" } ["body"]=> string(0) "" ["response"]=> array(2) { ["code"]=> string(3) "200" ["message"]=> string(2) "OK" } ["cookies"]=> array(0) { } ["filename"]=> string(39) "/Users/matt/tmp/wordpress/fsockopen.zip" } Calculated MD5 (76904d048685a5642fe322e450abb388) matches MD5 (76904d048685a5642fe322e450abb388) of ZIP Max Memory Usage: 16.821MB
Testing http_extension:
array(5) { ["headers"]=> array(9) { ["server"]=> string(5) "nginx" ["date"]=> string(29) "Thu, 24 Mar 2011 18:44:40 GMT" ["content-type"]=> string(15) "application/zip" ["transfer-encoding"]=> string(7) "chunked" ["connection"]=> string(5) "close" ["pragma"]=> string(8) "no-cache" ["cache-control"]=> string(7) "private" ["content-description"]=> string(13) "File Transfer" ["content-disposition"]=> string(38) "attachment; filename=wordpress-3.1.zip" } ["body"]=> string(0) "" ["response"]=> array(2) { ["code"]=> int(200) ["message"]=> string(2) "OK" } ["cookies"]=> array(0) { } ["filename"]=> string(44) "/Users/matt/tmp/wordpress/http_extension.zip" } Calculated MD5 (76904d048685a5642fe322e450abb388) matches MD5 (76904d048685a5642fe322e450abb388) of ZIP Max Memory Usage: 25.63MB
As expected exthttp shows higher memory usage.
As mentioned in a previous comment your latest patch has some extra code changes for wp_list_pluck, that should probably be stripped before commit.
#34
@
14 years ago
Looking back over the results it seems that the cURL transport is setting the body as true instead of ""
#35
@
14 years ago
The updated patch removes the wp_list_pluck code and fixes the body contents for cURL when streaming to a file since curl_exec will return true in such a case over writing the empty body.
#36
@
14 years ago
Looking back over the results it seems that the cURL transport is setting the body as true instead of ""
Good catch!
I was aware of the list_pluck changes, Was intending on striping them out of the patch but forgot!
#40
@
14 years ago
- Keywords needs-testing added; has-patch removed
Need to make sure the issues raised in #16057 are dealt with here
Just uploaded a patch on #16057 for this, could do with some eyes
attachment 16057.htttp.diff added
- Implements checking that the written file size matches the content-length header if set.
fsockopen, fopen:
fread()+fwrite()
loop, easystreams:
stream_copy_to_stream($http_handle, $new_file_handle);
HTTP Extension: I cant see a case there which allows for streaming directly to files, So will probably have to read into memory, and write it out.
cURL: Can use
CURLOPT_FILE
to specify where to write the output,CURLOPT_WRITEHEADER
orCURLOPT_HEADERFUNCTION
should be able to be used to capture the header dataSome of the classes might be able to be unset or destroyed better to decrease memory usage after a request is made.