11549

Scrapy recursive link crawler

It starts with a url on the web (ex: http://python.org), fetches the web-page corresponding to that url, and parses all the links on that page into a repository of links. Next, it fetches the contents of any of the url from the repository just created, parses the links from this new content into the repository and continues this process for all links in the repository until stopped or after a given number of links are fetched.

How can i do that using python and scrapy?. I am able to scrape all links in a webpage but how to perform it recursively in depth

Answer1:

Several remarks :

    <li>you don't need Scrapy for such a simple task. Urllib (or Requests) and a html parser (Beautiful soup, etc.) can do the job</li> <li>I don't recall where I've heard it, but I think it's better to crawl using BFS algorithms. You can easily avoid circular references.</li> </ul>

    Below a simple implementation : it does not fetcch internal links (only absolute formed hyperlinks) nor does it have any Error handling (403,404,no links,...), and it is abysmally slow ( the multiprocessing module can help a lot in this case).

    import BeautifulSoup import urllib2 import itertools import random class Crawler(object): """docstring for Crawler""" def __init__(self): self.soup = None # Beautiful Soup object self.current_page = "http://www.python.org/" # Current page's address self.links = set() # Queue with every links fetched self.visited_links = set() self.counter = 0 # Simple counter for debug purpose def open(self): # Open url print self.counter , ":", self.current_page res = urllib2.urlopen(self.current_page) html_code = res.read() self.visited_links.add(self.current_page) # Fetch every links self.soup = BeautifulSoup.BeautifulSoup(html_code) page_links = [] try : page_links = itertools.ifilter( # Only deal with absolute links lambda href: 'http://' in href, ( a.get('href') for a in self.soup.findAll('a') ) ) except Exception: # Magnificent exception handling pass # Update links self.links = self.links.union( set(page_links) ) # Choose a random url from non-visited set self.current_page = random.sample( self.links.difference(self.visited_links),1)[0] self.counter+=1 def run(self): # Crawl 3 webpages (or stop if all url has been fetched) while len(self.visited_links) < 3 or (self.visited_links == self.links): self.open() for link in self.links: print link if __name__ == '__main__': C = Crawler() C.run()

    Output:

    In [48]: run BFScrawler.py 0 : http://www.python.org/ 1 : http://twistedmatrix.com/trac/ 2 : http://www.flowroute.com/ http://www.egenix.com/files/python/mxODBC.html http://wiki.python.org/moin/PyQt http://wiki.python.org/moin/DatabaseProgramming/ http://wiki.python.org/moin/CgiScripts http://wiki.python.org/moin/WebProgramming http://trac.edgewall.org/ http://www.facebook.com/flowroute http://www.flowroute.com/ http://www.opensource.org/licenses/mit-license.php http://roundup.sourceforge.net/ http://www.zope.org/ http://www.linkedin.com/company/flowroute http://wiki.python.org/moin/TkInter http://pypi.python.org/pypi http://pycon.org/#calendar http://dyn.com/ http://www.google.com/calendar/ical/j7gov1cmnqr9tvg14k621j7t5c%40group.calendar. google.com/public/basic.ics http://www.pygame.org/news.html http://www.turbogears.org/ http://www.openbookproject.net/pybiblio/ http://wiki.python.org/moin/IntegratedDevelopmentEnvironments http://support.flowroute.com/forums http://www.pentangle.net/python/handbook/ http://dreamhost.com/?q=twisted http://www.vrplumber.com/py3d.py http://sourceforge.net/projects/mysql-python http://wiki.python.org/moin/GuiProgramming http://software-carpentry.org/ http://www.google.com/calendar/ical/3haig2m9msslkpf2tn1h56nn9g%40group.calendar. google.com/public/basic.ics http://wiki.python.org/moin/WxPython http://wiki.python.org/moin/PythonXml http://www.pytennessee.org/ http://labs.twistedmatrix.com/ http://www.found.no/ http://www.prnewswire.com/news-releases/voip-innovator-flowroute-relocates-to-se attle-190011751.html http://www.timparkin.co.uk/ http://docs.python.org/howto/sockets.html http://blog.python.org/ http://docs.python.org/devguide/ http://www.djangoproject.com/ http://buildbot.net/trac http://docs.python.org/3/ http://www.prnewswire.com/news-releases/flowroute-joins-voxbones-inum-network-fo r-global-voip-calling-197319371.html http://www.psfmember.org http://docs.python.org/2/ http://wiki.python.org/moin/Languages http://sip-trunking.tmcnet.com/topics/enterprise-voip/articles/341902-grandstrea m-ip-voice-solutions-receive-flowroute-certification.htm http://www.twitter.com/flowroute http://wiki.python.org/moin/NumericAndScientific http://www.google.com/calendar/ical/b6v58qvojllt0i6ql654r1vh00%40group.calendar. google.com/public/basic.ics http://freecode.com/projects/pykyra http://www.xs4all.com/ http://blog.flowroute.com http://wiki.python.org/moin/PyGtk http://twistedmatrix.com/trac/ http://wiki.python.org/moin/ http://wiki.python.org/moin/Python2orPython3 http://stackoverflow.com/questions/tagged/twisted http://www.pycon.org/

    Answer2:

    Here is the main crawl method written to scrap links recursively from a webpage. This method will crawl a URL and put all the crawled URLs into a buffer. Now multiple threads will be waiting to pop URLs from this global buffer and again call this crawl method.

    def crawl(self,urlObj): '''Main function to crawl URL's ''' try: if ((urlObj.valid) and (urlObj.url not in CRAWLED_URLS.keys())): rsp = urlcon.urlopen(urlObj.url,timeout=2) hCode = rsp.read() soup = BeautifulSoup(hCode) links = self.scrap(soup) boolStatus = self.checkmax() if boolStatus: CRAWLED_URLS.setdefault(urlObj.url,"True") else: return for eachLink in links: if eachLink not in VISITED_URLS: parsedURL = urlparse(eachLink) if parsedURL.scheme and "javascript" in parsedURL.scheme: #print("***************Javascript found in scheme " + str(eachLink) + "**************") continue '''Handle internal URLs ''' try: if not parsedURL.scheme and not parsedURL.netloc: #print("No scheme and host found for " + str(eachLink)) newURL = urlunparse(parsedURL._replace(**{"scheme":urlObj.scheme,"netloc":urlObj.netloc})) eachLink = newURL elif not parsedURL.scheme : #print("Scheme not found for " + str(eachLink)) newURL = urlunparse(parsedURL._replace(**{"scheme":urlObj.scheme})) eachLink = newURL if eachLink not in VISITED_URLS: #Check again for internal URL's #print(" Found child link " + eachLink) CRAWL_BUFFER.append(eachLink) with self._lock: self.count += 1 #print(" Count is =================> " + str(self.count)) boolStatus = self.checkmax() if boolStatus: VISITED_URLS.setdefault(eachLink, "True") else: return except TypeError: print("Type error occured ") else: print("URL already present in visited " + str(urlObj.url)) except socket.timeout as e: print("**************** Socket timeout occured*******************" ) except URLError as e: if isinstance(e.reason, ConnectionRefusedError): print("**************** Conn refused error occured*******************") elif isinstance(e.reason, socket.timeout): print("**************** Socket timed out error occured***************" ) elif isinstance(e.reason, OSError): print("**************** OS error occured*************") elif isinstance(e,HTTPError): print("**************** HTTP Error occured*************") else: print("**************** URL Error occured***************") except Exception as e: print("Unknown exception occured while fetching HTML code" + str(e)) traceback.print_exc()

    The complete source code and instructions are available at https://github.com/tarunbansal/crawler

Recommend

  • Convert string to time with python [duplicate]
  • TypeError : 'NoneType' object not callable when using split in Python with BeautifulSoup
  • Injecting arguments in scrapy's pipeline
  • 2d array, all values are adjacent
  • Shortest path in a grid
  • How to modify search result page given by Solr?
  • What products support 3-digit region subtags, e.g., es-419 for Latin-American Spanish?
  • Division with Aggregate Functions in SQL Not Behaving as Expected
  • Linq Full Outer Join on Two Objects
  • Cannot invoke my method on the array type int[]
  • distinct values from multiple fields within one table ORACLE SQL
  • SIP API media codecs
  • Problem with Django using Apache2 (mod_wsgi), Occassionally is “unable to import from module” for no
  • What command do i need to pass in SabreCommandLLSRQ to get current price of PNR?
  • JSR-330 support in Picocontainer : @Inject … @Named(\"xxx)
  • SyntaxError: (irb):26: both block arg and actual block given
  • Jquery Knockout: ko.computed() vs classic function?
  • Creating a DropDownList
  • Who propagate bugfixes across branches (corporate development)?
  • How do I display a dialog that asks the user multi-choice questıon using tkInter?
  • JqueryMobile Popup menu is not working
  • Java color detection
  • Reading a file into a multidimensional array
  • Android application: how to use the camera and grab the image bytes?
  • ADO and msqli connections very slow
  • zope_i18n_compile_mo_files doesn't work on a Zeo configuration
  • OOP Javascript - Is “get property” method necessary?
  • Can you perform a UNION without a subquery in SQLAlchemy?
  • How to clear text inside text field when radio button is select
  • How to rebase a series of branches?
  • NetLogo BehaviorSpace - Measure runs using reporters
  • How to handle AllServersUnavailable Exception
  • SVN: Merging two branches together
  • Hibernate gives error error as “Access to DialectResolutionInfo cannot be null when 'hibernate.
  • retrieve vertices with no linked edge in arangodb
  • using conditional logic : check if record exists; if it does, update it, if not, create it
  • Understanding cpu registers
  • How to CLICK on IE download dialog box i.e.(Open, Save, Save As…)
  • Can Visual Studio XAML designer handle font family names with spaces as a resource?
  • Add sale price programmatically to product variations