Scrapy - offsite request to be processed based on a regex


I have to crawl 5-6 domains. I wanted to write a the crawler as such that the offsite requests if contains a some substrings example set as [ aaa,bbb,ccc] if the offsite url contain a substring from the above set then it should be processed and not filter out. Should i write a custom middleware or can i just use regular expression in the allowed domains.


Offsite middleware already uses regex by default, however it's no exposed. It compiles the domains you provide into regex, but the domains are escaped so providing regex code in allowed_domains would not work.

What you can do though is extend that middleware and override get_host_regex() method to implement your own offsite policy.

The original code in scrapy.spidermiddlewares.offsite.OffsiteMiddleware:

def get_host_regex(self, spider): """Override this method to implement a different offsite policy""" allowed_domains = getattr(spider, 'allowed_domains', None) if not allowed_domains: return re.compile('') # allow all by default regex = r'^(.*\.)?(%s)$' % '|'.join(re.escape(d) for d in allowed_domains if d is not None) return re.compile(regex)

You can just override to return your own regex:

# middlewares.py class MyOffsiteMiddleware(OffsiteMiddleware): def get_host_regex(self, spider): allowed_regex = getattr(spider, 'allowed_regex', '') return re.compile(allowed_regex) # spiders/myspider.py class MySpider(scrapy.Spider): allowed_regex = '.+?\.com' # settings.py DOWNLOADER_MIDDLEWARES = { 'myproject.middlewares.MyOffsiteMiddleware': 666, }


  • VS2008 partially freezing when switching to HTML design view
  • How to make UI in blackberry including images in such a way that it works in different screen resolu
  • Escape input data in SQL queries when using ODBC + Access
  • SetHidden not working [duplicate]
  • $_GET URL from ?url=http://google.com
  • RETROFIT how to parse this response
  • Localization laravel
  • Issue with Route Protection in Laravel 5.3
  • Is it mandatory to use Kony middleware for Kony application?
  • Multiple auth user types in Laravel 5
  • Read a file in “chunks” using PHP
  • Leaflet z-index
  • Raphael.js function getBBox give back NAN/NAN/NAN in IE8
  • matching similar elements in between two lists
  • auth.provider is not set to 'password' when user signs-in with email and password
  • Guava how to copy all files from one directory to another
  • Efficient User-Agent Regex to find Safari in Python
  • How can I extend PHP DOMElement?
  • Why people use prototype in javascript when it is easy to inherit using apply () and call () methods
  • IE10 strips out hashtag from the URL
  • Create a link to a web page that runs a Javascript function on the page
  • NUnit 3.0 TestCase const custom object arguments
  • Plotting line graph with factors in R
  • Can you perform a UNION without a subquery in SQLAlchemy?
  • Extracting HTML between tags
  • NHibernate Validation Localization with S#arp Architecture
  • FFmpeg Conversion Error
  • MongoDB in PHP using aggregate to group by _id is null not working
  • Regex thinks I'm nesting, but I'm not
  • Why HTML5 Canvas with a larger size stretch a drawn line?
  • Spray.io: When (not) to use non-blocking route handling?
  • Bug in WPF DataGrid
  • Modifying destination and filename of gulp-svg-sprite
  • Redux, normalised entities and lodash merge
  • How to make Safari send if-modified-since header?
  • GridView Sorting works once only
  • how does django model after text[] in postgresql [duplicate]
  • File not found error Google Drive API