38156

wget with python time limit

Question:

I have a large text file of URLs which I have to download via wget. I have written a small python script which basically loops through each domain name and download them using wget (os.system("wget "+URL)). But the problem is that wget just hangs on a connection if the remote server doesn't reply after connecting. How do I set a time limit in such a case? I want to terminate wget after some time if the remote server is not replying after connection.

Regards,

Answer1:

This seems to be less a question about python, and more a question about how to use wget. in gnu wget, which you are likely using, the default number of retries is 20. you can set trieds using -t, perhaps wget -t0 would quickly skip it if the file fails to download. alternatively, you could use the -S flag to get sever response, and have python react appropriately. But, the most helpful options to you would be -T or timeout, set that to -T10 to have it timeout after ten seconds and move on.

<h3>edit:</h3>

If all you are doing is iterating through a list and downloading a list of URLs I would just use wget, no need for python here. In fact, you can do it in one line

awk '{print "wget -t2 -T5 --append-output=wget.log \"" $0 "\""}' listOfUrls | bash

what this is doing is running through a list of urls, and calling wget, where wget tries to download the file twice, and waits 5 seconds before terminating the connection, it also appends the response to wget.log, which you can grep at the end looking for a 404 error.

Answer2:

Use the --timeout seconds argument to limit the time for a request. You can be more specific and use --connect-timeout seconds if needed. See the <a href="http://www.gnu.org/software/wget/manual/wget.html" rel="nofollow">wget Manual</a> for more information.

Answer3:

You don't need to use external tools such as wget. Use built-in urllib2 to download files. The documentation is available <a href="http://docs.python.org/library/urllib2.html" rel="nofollow">here</a>

Answer4:

You shouldn't be calling the wget binary to do a task like this from Python. Use one of the available <strong>HTTP libraries</strong> for Python instead, you'll get much better error handling and control.

There's urllib2 (<a href="http://docs.python.org/library/urllib2.html" rel="nofollow">official docs</a>, <a href="http://www.voidspace.org.uk/python/articles/urllib2.shtml" rel="nofollow">Missing Manual</a>) which is part of the standard library.

However, I'd strongly recommend to use the excellent <a href="http://docs.python-requests.org" rel="nofollow"><strong>requests</strong> module</a> instead. It has a very clean API, makes simple tasks simple, as they should be, but still offers a ton of flexibility and fine grained control.

Using the requests module, you can <a href="http://docs.python-requests.org/en/latest/user/quickstart/#timeouts" rel="nofollow">specify the timeout</a> (in seconds) by using the timeout keyword argument like so:

response = requests.get(url, timeout=0.02)

If the timeout is exceeded, a Timeout exception will be raised, which you'll need catch and handle it any way you like.

import requests from requests.exceptions import Timeout, ConnectionError TIMEOUT = 0.02 urls = ['http://www.stackoverflow.com', 'http://www.google.com'] for url in urls: try: response = requests.get(url, timeout=TIMEOUT) print "Got response %s" % response.status_code response_body = response.content except (ConnectionError, Timeout), e: print "Request for %s failed: %s" % (url, e) # Handle however you need to ...

Sample output:

Request for http://www.stackoverflow.com failed: Request timed out. Request for http://www.google.com failed: Request timed out.

Recommend

  • Custom DateTime format
  • How to finish this Google Calendar Api v3 - FreeBusy PHP - example?
  • JPA EntityGraph based on meta model containing MappedSuperclass not possible?
  • dynamic change of templateUrl in ui-router from one state to another
  • “Movie Format Not Supported” on iPhone with YouTube player in a WebView
  • Mongolab connection error
  • WildCard for Object in Java6
  • Get URL Query String Parameters when there are multiple parameters with the same name using Jquery o
  • How to communicate between ASPX and WinForms
  • Relationship between integers and their names - Prolog
  • Webpack-dev-server and isomorphic react-node application
  • Taking mean across rows grouped by a variable in numpy
  • Print: Entry, “:CFBundleIdentifier”, Does Not Exist have tried most solutions
  • How to Divide an array on c#?
  • Returning this from a constructor function in JS
  • what is the purpose of “export as namespace foo”?
  • What is this strange character in chrome's resource css viewer?
  • Tools for understanding HTML layout
  • Is there any way to call saveCurrentTurnWithMatchData without sending a push notification?
  • C++ cout and enum representations
  • Where these are stored?
  • How can we prepend rows to a react native list-view?
  • Angular2 - Template reference inside NgSwitch
  • CSS bleed-through with cfinput type=“datefield”
  • how do i write assembly code from c#?
  • android.support.v7.widget.Toolbar VectorDrawableCompat IllegalStateException when using support lib
  • Cannot upload to OneDrive using the new SDK
  • print() is showing quotation marks in results
  • Refering to the class itself from within a class mehod in Objective C
  • Q promise. Difference between .when and .then
  • How to rebase a series of branches?
  • Illegal mix of collations for operation for date/time comparison
  • Ajax jQuery multiple calls at the same time - long wait for answer and not able to cancel
  • Deleting and Updating values from a cusrsor adapter
  • Modifying destination and filename of gulp-svg-sprite
  • 'TypeError' while using NSGA2 to solve Multi-objective prob. from pyopt-sparse in OpenMDAO
  • Hazelcast - OperationTimeoutException
  • retrieve vertices with no linked edge in arangodb
  • File upload with ng-file-upload throwing error
  • NSLayoutConstraint that would pin a view to the bottom edge of a superview