Scrapy recursive link crawler

It starts with a url on the web (ex: http://python.org), fetches the web-page corresponding to that url, and parses all the links on that page into a repository of links. Next, it fetches the contents of any of the url from the repository just created, parses the links from this new content into the repository and continues this process for all links in the repository until stopped or after a given number of links are fetched.

How can i do that using python and scrapy?. I am able to scrape all links in a webpage but how to perform it recursively in depth


Several remarks :

    <li>you don't need Scrapy for such a simple task. Urllib (or Requests) and a html parser (Beautiful soup, etc.) can do the job</li> <li>I don't recall where I've heard it, but I think it's better to crawl using BFS algorithms. You can easily avoid circular references.</li> </ul>

    Below a simple implementation : it does not fetcch internal links (only absolute formed hyperlinks) nor does it have any Error handling (403,404,no links,...), and it is abysmally slow ( the multiprocessing module can help a lot in this case).

    import BeautifulSoup import urllib2 import itertools import random class Crawler(object): """docstring for Crawler""" def __init__(self): self.soup = None # Beautiful Soup object self.current_page = "http://www.python.org/" # Current page's address self.links = set() # Queue with every links fetched self.visited_links = set() self.counter = 0 # Simple counter for debug purpose def open(self): # Open url print self.counter , ":", self.current_page res = urllib2.urlopen(self.current_page) html_code = res.read() self.visited_links.add(self.current_page) # Fetch every links self.soup = BeautifulSoup.BeautifulSoup(html_code) page_links = [] try : page_links = itertools.ifilter( # Only deal with absolute links lambda href: 'http://' in href, ( a.get('href') for a in self.soup.findAll('a') ) ) except Exception: # Magnificent exception handling pass # Update links self.links = self.links.union( set(page_links) ) # Choose a random url from non-visited set self.current_page = random.sample( self.links.difference(self.visited_links),1)[0] self.counter+=1 def run(self): # Crawl 3 webpages (or stop if all url has been fetched) while len(self.visited_links) < 3 or (self.visited_links == self.links): self.open() for link in self.links: print link if __name__ == '__main__': C = Crawler() C.run()


    In [48]: run BFScrawler.py 0 : http://www.python.org/ 1 : http://twistedmatrix.com/trac/ 2 : http://www.flowroute.com/ http://www.egenix.com/files/python/mxODBC.html http://wiki.python.org/moin/PyQt http://wiki.python.org/moin/DatabaseProgramming/ http://wiki.python.org/moin/CgiScripts http://wiki.python.org/moin/WebProgramming http://trac.edgewall.org/ http://www.facebook.com/flowroute http://www.flowroute.com/ http://www.opensource.org/licenses/mit-license.php http://roundup.sourceforge.net/ http://www.zope.org/ http://www.linkedin.com/company/flowroute http://wiki.python.org/moin/TkInter http://pypi.python.org/pypi http://pycon.org/#calendar http://dyn.com/ http://www.google.com/calendar/ical/j7gov1cmnqr9tvg14k621j7t5c%40group.calendar. google.com/public/basic.ics http://www.pygame.org/news.html http://www.turbogears.org/ http://www.openbookproject.net/pybiblio/ http://wiki.python.org/moin/IntegratedDevelopmentEnvironments http://support.flowroute.com/forums http://www.pentangle.net/python/handbook/ http://dreamhost.com/?q=twisted http://www.vrplumber.com/py3d.py http://sourceforge.net/projects/mysql-python http://wiki.python.org/moin/GuiProgramming http://software-carpentry.org/ http://www.google.com/calendar/ical/3haig2m9msslkpf2tn1h56nn9g%40group.calendar. google.com/public/basic.ics http://wiki.python.org/moin/WxPython http://wiki.python.org/moin/PythonXml http://www.pytennessee.org/ http://labs.twistedmatrix.com/ http://www.found.no/ http://www.prnewswire.com/news-releases/voip-innovator-flowroute-relocates-to-se attle-190011751.html http://www.timparkin.co.uk/ http://docs.python.org/howto/sockets.html http://blog.python.org/ http://docs.python.org/devguide/ http://www.djangoproject.com/ http://buildbot.net/trac http://docs.python.org/3/ http://www.prnewswire.com/news-releases/flowroute-joins-voxbones-inum-network-fo r-global-voip-calling-197319371.html http://www.psfmember.org http://docs.python.org/2/ http://wiki.python.org/moin/Languages http://sip-trunking.tmcnet.com/topics/enterprise-voip/articles/341902-grandstrea m-ip-voice-solutions-receive-flowroute-certification.htm http://www.twitter.com/flowroute http://wiki.python.org/moin/NumericAndScientific http://www.google.com/calendar/ical/b6v58qvojllt0i6ql654r1vh00%40group.calendar. google.com/public/basic.ics http://freecode.com/projects/pykyra http://www.xs4all.com/ http://blog.flowroute.com http://wiki.python.org/moin/PyGtk http://twistedmatrix.com/trac/ http://wiki.python.org/moin/ http://wiki.python.org/moin/Python2orPython3 http://stackoverflow.com/questions/tagged/twisted http://www.pycon.org/


    Here is the main crawl method written to scrap links recursively from a webpage. This method will crawl a URL and put all the crawled URLs into a buffer. Now multiple threads will be waiting to pop URLs from this global buffer and again call this crawl method.

    def crawl(self,urlObj): '''Main function to crawl URL's ''' try: if ((urlObj.valid) and (urlObj.url not in CRAWLED_URLS.keys())): rsp = urlcon.urlopen(urlObj.url,timeout=2) hCode = rsp.read() soup = BeautifulSoup(hCode) links = self.scrap(soup) boolStatus = self.checkmax() if boolStatus: CRAWLED_URLS.setdefault(urlObj.url,"True") else: return for eachLink in links: if eachLink not in VISITED_URLS: parsedURL = urlparse(eachLink) if parsedURL.scheme and "javascript" in parsedURL.scheme: #print("***************Javascript found in scheme " + str(eachLink) + "**************") continue '''Handle internal URLs ''' try: if not parsedURL.scheme and not parsedURL.netloc: #print("No scheme and host found for " + str(eachLink)) newURL = urlunparse(parsedURL._replace(**{"scheme":urlObj.scheme,"netloc":urlObj.netloc})) eachLink = newURL elif not parsedURL.scheme : #print("Scheme not found for " + str(eachLink)) newURL = urlunparse(parsedURL._replace(**{"scheme":urlObj.scheme})) eachLink = newURL if eachLink not in VISITED_URLS: #Check again for internal URL's #print(" Found child link " + eachLink) CRAWL_BUFFER.append(eachLink) with self._lock: self.count += 1 #print(" Count is =================> " + str(self.count)) boolStatus = self.checkmax() if boolStatus: VISITED_URLS.setdefault(eachLink, "True") else: return except TypeError: print("Type error occured ") else: print("URL already present in visited " + str(urlObj.url)) except socket.timeout as e: print("**************** Socket timeout occured*******************" ) except URLError as e: if isinstance(e.reason, ConnectionRefusedError): print("**************** Conn refused error occured*******************") elif isinstance(e.reason, socket.timeout): print("**************** Socket timed out error occured***************" ) elif isinstance(e.reason, OSError): print("**************** OS error occured*************") elif isinstance(e,HTTPError): print("**************** HTTP Error occured*************") else: print("**************** URL Error occured***************") except Exception as e: print("Unknown exception occured while fetching HTML code" + str(e)) traceback.print_exc()

    The complete source code and instructions are available at https://github.com/tarunbansal/crawler


