84470

Scrapy - offsite request to be processed based on a regex

Question:

I have to crawl 5-6 domains. I wanted to write a the crawler as such that the offsite requests if contains a some substrings example set as [ aaa,bbb,ccc] if the offsite url contain a substring from the above set then it should be processed and not filter out. Should i write a custom middleware or can i just use regular expression in the allowed domains.

Answer1:

Offsite middleware already uses regex by default, however it's no exposed. It compiles the domains you provide into regex, but the domains are escaped so providing regex code in allowed_domains would not work.

What you can do though is extend that middleware and override get_host_regex() method to implement your own offsite policy.

The original code in scrapy.spidermiddlewares.offsite.OffsiteMiddleware:

def get_host_regex(self, spider): """Override this method to implement a different offsite policy""" allowed_domains = getattr(spider, 'allowed_domains', None) if not allowed_domains: return re.compile('') # allow all by default regex = r'^(.*\.)?(%s)$' % '|'.join(re.escape(d) for d in allowed_domains if d is not None) return re.compile(regex)

You can just override to return your own regex:

# middlewares.py class MyOffsiteMiddleware(OffsiteMiddleware): def get_host_regex(self, spider): allowed_regex = getattr(spider, 'allowed_regex', '') return re.compile(allowed_regex) # spiders/myspider.py class MySpider(scrapy.Spider): allowed_regex = '.+?\.com' # settings.py DOWNLOADER_MIDDLEWARES = { 'myproject.middlewares.MyOffsiteMiddleware': 666, }

Recommend

  • VS2008 partially freezing when switching to HTML design view
  • How to make UI in blackberry including images in such a way that it works in different screen resolu
  • Escape input data in SQL queries when using ODBC + Access
  • SetHidden not working [duplicate]
  • $_GET URL from ?url=http://google.com
  • RETROFIT how to parse this response
  • Localization laravel
  • Issue with Route Protection in Laravel 5.3
  • Is it mandatory to use Kony middleware for Kony application?
  • Multiple auth user types in Laravel 5
  • Read a file in “chunks” using PHP
  • Leaflet z-index
  • Raphael.js function getBBox give back NAN/NAN/NAN in IE8
  • matching similar elements in between two lists
  • auth.provider is not set to 'password' when user signs-in with email and password
  • Guava how to copy all files from one directory to another
  • Efficient User-Agent Regex to find Safari in Python
  • How can I extend PHP DOMElement?
  • Why people use prototype in javascript when it is easy to inherit using apply () and call () methods
  • IE10 strips out hashtag from the URL
  • Create a link to a web page that runs a Javascript function on the page
  • NUnit 3.0 TestCase const custom object arguments
  • Plotting line graph with factors in R
  • Can you perform a UNION without a subquery in SQLAlchemy?
  • Extracting HTML between tags
  • NHibernate Validation Localization with S#arp Architecture
  • FFmpeg Conversion Error
  • MongoDB in PHP using aggregate to group by _id is null not working
  • Regex thinks I'm nesting, but I'm not
  • Why HTML5 Canvas with a larger size stretch a drawn line?
  • Spray.io: When (not) to use non-blocking route handling?
  • Bug in WPF DataGrid
  • Modifying destination and filename of gulp-svg-sprite
  • Redux, normalised entities and lodash merge
  • How to make Safari send if-modified-since header?
  • GridView Sorting works once only
  • how does django model after text[] in postgresql [duplicate]
  • File not found error Google Drive API