Scrapy - offsite request to be processed based on a regex


I have to crawl 5-6 domains. I wanted to write a the crawler as such that the offsite requests if contains a some substrings example set as [ aaa,bbb,ccc] if the offsite url contain a substring from the above set then it should be processed and not filter out. Should i write a custom middleware or can i just use regular expression in the allowed domains.


Offsite middleware already uses regex by default, however it's no exposed. It compiles the domains you provide into regex, but the domains are escaped so providing regex code in allowed_domains would not work.

What you can do though is extend that middleware and override get_host_regex() method to implement your own offsite policy.

The original code in scrapy.spidermiddlewares.offsite.OffsiteMiddleware:

def get_host_regex(self, spider): """Override this method to implement a different offsite policy""" allowed_domains = getattr(spider, 'allowed_domains', None) if not allowed_domains: return re.compile('') # allow all by default regex = r'^(.*\.)?(%s)$' % '|'.join(re.escape(d) for d in allowed_domains if d is not None) return re.compile(regex)

You can just override to return your own regex:

# middlewares.py class MyOffsiteMiddleware(OffsiteMiddleware): def get_host_regex(self, spider): allowed_regex = getattr(spider, 'allowed_regex', '') return re.compile(allowed_regex) # spiders/myspider.py class MySpider(scrapy.Spider): allowed_regex = '.+?\.com' # settings.py DOWNLOADER_MIDDLEWARES = { 'myproject.middlewares.MyOffsiteMiddleware': 666, }


