31011

Python to Save Web Pages

Question:

This is probably a very simple task, but I cannot find any help. I have a website that takes the form www.xyz.com/somestuff/ID. I have a list of the IDs I need information from. I was hoping to have a simple script to go one the site and download the (complete) web page for each ID in a simple form ID_whatever_the_default_save_name_is in a specific folder.

Can I run a simple python script to do this for me? I can do it by hand, it is only 75 different pages, but I was hoping to use this to learn how to do things like this in the future.

Answer1:

<a href="https://pypi.python.org/pypi/mechanize/" rel="nofollow">Mechanize</a> is a great package for crawling the web with python. A simple example for your issue would be:

import mechanize br = mechanize.Browser() response = br.open("www.xyz.com/somestuff/ID") print response

This simply grabs your url and prints the response from the server.

Answer2:

This can be done simply in python using the urllib module. Here is a simple example in Python 3:

import urllib.request url = 'www.xyz.com/somestuff/ID' req = urllib.request.Request(url) page = urllib.request.urlopen(req) src = page.readall() print(src)

For more info on the urllib module -> <a href="http://docs.python.org/3.3/library/urllib.html" rel="nofollow">http://docs.python.org/3.3/library/urllib.html</a>

Answer3:

Do you want just the html code for the website? If so, just create a url variable with the host site and add the page number as you go. I'll do this for an example with <a href="http://www.notalwaysright.com" rel="nofollow">http://www.notalwaysright.com</a>

import urllib.request url = "http://www.notalwaysright.com/page/" for x in range(1, 71): newurl = url + x response = urllib.request.urlopen(newurl) with open("Page/" + x, "a") as p: p.writelines(reponse.read())

Recommend

  • mechanize: first form works, then “unknown GET form encoding type 'utf-8'”
  • Use Ruby Mechanize to scrape all successive pages
  • Using Python and BeautifulSoup to Parse a Table
  • Python submit post data using mechanize
  • logging into https site using python mechanize library
  • Sorting parallel arrays in javascript
  • How to specify columns in Swagger
  • What's a better way to swap two argument values?
  • How to use a decaying learning rate with an estimator in tensorflow?
  • How to combine two lists together?
  • Connect Node.js with Oracle on Windows platform
  • help('modules') crashing? Not sure how to fix
  • Can I update/select from a table in one query?
  • WordPress > setting permalink option via script buggy?
  • Google Maps api v3 get start and end coordinates of a street
  • Can I use AllJoyn Framework for Wifi Direct in iOS?
  • Angular2 - Template reference inside NgSwitch
  • Sensibility of combined Maven/Ant+Ivy build management for dual platform Desktop/Android deployment?
  • How can I run DataNucleus Bytecode Enhancer from SBT?
  • How Get arguments value using inline assembly in C without Glibc?
  • Jquery Knockout: ko.computed() vs classic function?
  • Combining two different ActiveRecord collections into one
  • Do I need to seed any random number generator before using EVP_PKEY_keygen of OpenSSL?
  • Can someone please explain to me in the most layman terms how to use EventArgs?
  • JqueryMobile Popup menu is not working
  • What does 'Language neutral' mean with regard to MAKELANGID?
  • How can I sort a a table with VBA with given text condition?
  • How to use carriage return with multiple line?
  • How to define and use opencv mat of user type
  • Why querying a date BC is changed to AD in Java?
  • Repeat a vertical line on every page in Report Builder / SSRS
  • Master page gives error
  • Spring security and special characters
  • Azure Cloud Service Web Role web pages do not load
  • Excel - Autoshape get it's name from cell (value)
  • Hibernate gives error error as “Access to DialectResolutionInfo cannot be null when 'hibernate.
  • Matrix multiplication with MKL
  • CSS Applying specific rule for a specific monitor resolution with only CSS is posible?
  • What are the advantages and disadvantages of reading an entire file into a single String as opposed
  • Converting MP3 duration time