2945

Script always gets a 302 response when pulling random pages from Wikipedia

I can pull a any page from wikipedia with

import httplib conn = httplib.HTTPConnection("en.wikipedia.org") conn.debuglevel = 1 conn.request("GET","/wiki/Normal_Distribution",headers={'User-Agent':'Python httplib'}) r1 = conn.getresponse() r1.read()

The normal response will be

reply: 'HTTP/1.0 200 OK\r\n' header: Date: Sun, 03 Apr 2011 23:49:36 GMT header: Server: Apache header: Cache-Control: private, s-maxage=0, max-age=0, must-revalidate header: Content-Language: en header: Vary: Accept-Encoding,Cookie header: Last-Modified: Sun, 03 Apr 2011 17:23:50 GMT header: Content-Length: 263638 header: Content-Type: text/html; charset=UTF-8 header: Age: 1280309 header: X-Cache: HIT from sq77.wikimedia.org header: X-Cache-Lookup: HIT from sq77.wikimedia.org:3128 header: X-Cache: MISS from sq66.wikimedia.org header: X-Cache-Lookup: MISS from sq66.wikimedia.org:80 header: Connection: close

But if I try to pull a random page with /wiki/Special:Random I get a 302 response and an empty page

reply: 'HTTP/1.0 302 Moved Temporarily\r\n' header: Date: Mon, 18 Apr 2011 19:25:52 GMT header: Server: Apache header: Cache-Control: private, s-maxage=0, max-age=0, must-revalidate header: Vary: Accept-Encoding,Cookie header: Expires: Thu, 01 Jan 1970 00:00:00 GMT header: Location: http://en.wikipedia.org/wiki/Tuticorin_Port_Trust header: Content-Length: 0 header: Content-Type: text/html; charset=utf-8 header: X-Cache: MISS from sq60.wikimedia.org header: X-Cache-Lookup: MISS from sq60.wikimedia.org:3128 header: X-Cache: MISS from sq62.wikimedia.org header: X-Cache-Lookup: MISS from sq62.wikimedia.org:80 header: Connection: close

How do I get a non-empty random page?

Answer1:

The 302 is a redirect. It's telling you where to go in the following line:

header: Location: http://en.wikipedia.org/wiki/tuticorin_port_trust

You just need to follow the redirect.

Answer2:

When you're redirected the response object is going to have a code of 302 and the geturl() method will report the redirect URL. Python's standard HTTP libraries make it non-trivial to handled redirects by default. Do yourself a favor, don't hassle with this stuff and use the 3rd party mechanize library, which is a drop-in replacement for urllib2.

Using mechanize, your code would look like this:

import httplib import mechanize host = 'en.wikipedia.org' path = '/wiki/Special:Random' url = 'http://' + host + path # We have to pass a http:// url # It still uses httplib.HTTPConnection, so we can debug httplib.HTTPConnection.debuglevel = 1 request = mechanize.Request(url, headers={'User-Agent': 'Python-mechanize'}) response = mechanize.urlopen(request) print response.code # => 200 print response.geturl() # => 'http://en.wikipedia.org/wiki/Faliszowice,_Lesser_Poland_Voivodeship' data = response.read()

Answer3:

HTTP code 302 means you are being redirected. If you look at the Location</b> header, you will see where you should make your new request. Then you can make the request to that URL and you'll hopefully get a 200 on that page.

To clarify: You are being requested to retry the request elsewhere. That's why your client needs to make another request when it receives a 302. Wikipedia's random page apparently works by choosing a random page in its database, then returning a 302 response with the new page as the Location field. If you look at other 302 responses, I'm sure you'll see a different page in the Location field.

Answer4:

Look at the location header:

header: Location: http://en.wikipedia.org/wiki/Tuticorin_Port_Trust

It says you should redirected to that page. Read that header and do another request to that page.

Recommend

  • set ruby hash element value by array of keys
  • Hide UIView starting from Bottom on Scroll
  • Getting rollback in creating shortcuts on installation of node js in windows 7
  • LDA: Why sampling for inference of a new document?
  • cannot load gems in test environment
  • php show all images in directory and sort by last modified
  • Stitching 2 images (OpenCV)
  • EF 4.1 DBContext AutoDetectChangesEnabled
  • Google analytics measurement protocol session timeout and query time limits
  • How to convert integer to string and get length of string
  • how to get username into sql trigger when multiple users signed on from asp membership
  • Yii2: Finding file and getting path in a directory tree
  • Clear activity stack before launching another activity
  • Angular2 Response for preflight is invalid (redirect) from some GET requests
  • RxJava debounce by arbitrary value
  • How do I configure context broker accept post requests from my remote sensor?
  • D3 get axis values on zoom event
  • C: Incompatible pointer type initializing
  • Encrypt data by using a public key in c# and decrypt data by using a private key in php
  • SSO with signing and signature validation doesn't work
  • MySQL WHERE-condition in procedure ignored
  • Deserializing XML into class C#
  • Web-crawler for facebook in python
  • How to get next/previous record number?
  • VB.net deserialize, JSON Conversion from type 'Dictionary(Of String,Object)' to type '
  • retrieve vertices with no linked edge in arangodb
  • How to get icons for entities from eclipse?
  • trying to dynamically update Highchart column chart but series undefined
  • Proper way to use connect-multiparty with express.js?
  • Load html files in TinyMce
  • How can I get HTML syntax highlighting in my editor for CakePHP?
  • FormattedException instead of throw new Exception(string.Format(…)) in .NET
  • How do I configure my settings file to work with unit tests?
  • Change div Background jquery
  • IndexOutOfRangeException on multidimensional array despite using GetLength check
  • apache spark aggregate function using min value
  • JaxB to read class hierarchy
  • costura.fody for a dll that references another dll
  • Binding checkboxes to object values in AngularJs
  • java string with new operator and a literal