23710

Links with space before and after are not parsed correctly

Question:

I have a website I'm crawling which has a white space before and after the URL

<a href=" /c/96894 ">Test</a>

Instead of crawling this:

http://www.stores.com/c/96894/

it crawls this:

http://www.store.com/c/%0A%0A/c/96894%0A%0A

Moreover, it causes an infinite loop for links that contain the same link like this:

http://www.store.com/cp/%0A%0A/cp/96894%0A%0A/cp/96894%0A%0A

Any white space(\r, \n, \t and space) before and after the the URL is ignored by all browsers. How do I go about trimming the whitespace of the crawled URLs?

Here's my code.

from scrapy.selector import HtmlXPathSelector from scrapy.contrib.spiders import CrawlSpider, Rule from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor from wallspider.items import Website class StoreSpider(CrawlSpider): name = "cpages" allowed_domains = ["www.store.com"] start_urls = ["http://www.sore.com",] rules = ( Rule (SgmlLinkExtractor(allow=('/c/', ),deny=('grid=false', 'sort=', 'stores=', '\|\|', 'page=',)) , callback="parse_items", follow= True, process_links=lambda links: [link for link in links if not link.nofollow],), Rule(SgmlLinkExtractor(allow=(),deny=('grid=false', 'sort=', 'stores=', '\|\|', 'page='))), ) def parse_items(self, response): hxs = HtmlXPathSelector(response) sites = hxs.select('//html') items = [] for site in sites: item = Website() item['url'] = response.url item['referer'] = response.request.headers.get('Referer') item['anchor'] = response.meta.get('link_text') item['canonical'] = site.xpath('//head/link[@rel="canonical"]/@href').extract() item['robots'] = site.select('//meta[@name="robots"]/@content').extract() items.append(item) return items

Answer1:

I used process_value=cleanurl in my LinkExtractor instance

def cleanurl(link_text): return link_text.strip("\t\r\n ")

The code if anyone runs into the same problem:

from scrapy.selector import HtmlXPathSelector from scrapy.contrib.spiders import CrawlSpider, Rule from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor from wallspider.items import Website class storeSpider(CrawlSpider): name = "cppages" allowed_domains = ["www.store.com"] start_urls = ["http://www.store.com",] def cleanurl(link_text): return link_text.strip("\t\r\n '\"") rules = ( Rule (SgmlLinkExtractor(allow=('/cp/', ),deny=('grid=false', 'sort=', 'stores=', r'\|\|', 'page=',), process_value=cleanurl) , callback="parse_items", follow= True, process_links=lambda links: [link for link in links if not link.nofollow],), Rule(SgmlLinkExtractor(allow=('/cp/', '/browse/', ),deny=('grid=false', 'sort=', 'stores=', r'\|\|', 'page='), process_value=cleanurl)), ) def parse_items(self, response): hxs = HtmlXPathSelector(response) sites = hxs.select('//html') items = [] for site in sites: item = Website() item['url'] = response.url item['referer'] = response.request.headers.get('Referer') item['anchor'] = response.meta.get('link_text') item['canonical'] = site.xpath('//head/link[@rel="canonical"]/@href').extract() item['robots'] = site.select('//meta[@name="robots"]/@content').extract() items.append(item) return items

Answer2:

You can replace the white space with '' like,

url = response.url item['url'] = url.replace(' ', '')

Or, using regular expression,

import re url = response.url item['url'] = re.sub(r'\s', '', url)

Recommend

  • From scraper_user.items import UserItem ImportError: No module named scraper_user.items
  • Calling the same spider programmatically
  • Iterate and mutate an Array of Dictionaries Swift 3
  • Intermittent “getrandom() initialization failed” using scrapy spider
  • Is Scrapy able to crawl any type of websites? [closed]
  • Scrapy not following pagination properly, catches the first link in the pagination
  • Crawl a full domain and load all h1 into a item
  • Scrapy- How to extract all blog posts from a category?
  • no @interface for 'UITableView' declares the selector 'initWithStyle:reuseIdentifiers
  • How do I select an image based on its path/url?
  • Why would you want to use composition in golang?
  • Two simultaneous background tasks using NSOperationQueue
  • Uncaught Reference Input is not defined- @Input() not working in Angular 2
  • How to do calculations with variables in jQuery?
  • @viewChild return undefined
  • using AVSystemController in iPhone App
  • How to inject service to class and then extend component with it?
  • UIBarButtonItem - Argument of '#selector' cannot refer to local function - Swift 3
  • Argument of '#selector' does not refer to an '@objc' method (swift 3)
  • How to remove duplicate buttons
  • What's the need of Informal Protocols?
  • UIBarButtonItem cutoff when UINavigationController presented in UIPopoverController
  • Passing argument within action selection in UIButton
  • How to change select tag value when other select is change?
  • How to generate random events in android?
  • Using ant, find files matching a regular expression and search if a substring in present in the file
  • Can I have the market update an app that was installed from else where?
  • NSTimer and updating UI
  • respondsToSelector - not working
  • Angular 2: is styleUrls relative to the current component?
  • Incorrect behaviour when selecting chips in Angular Material
  • In Angular 2 how to get @Input value updated inside component?
  • Simple Angular 2 app gives “Potentially unhandled rejection” error
  • How to click on a link that has a certain content in puppeteer?
  • Git cleanup/garbage collection on remote VSO git repository
  • How to make the tableview response pan gesture in ZUUIRevealController
  • How to replace TouchesBegan with UIGestureRecognizer
  • how to remove a div with same ids but display='block' and display='none' in JAVa
  • Jquery popup on mouse over of calendar control
  • Observable and ngFor in Angular 2