4128

'charmap' codec can't encode character '\\xae' While Scraping a Webpage

Question:

I am web-scraping with Python using BeautifulSoap I am getting this error

'charmap' codec can't encode character '\xae' in position 69: character maps to <undefined>

when scraping a webpage

This is my Python

hotel = BeautifulSoup(state.) print (hotel.select("div.details.cf span.hotel-name a")) # Tried: print (hotel.select("div.details.cf span.hotel-name a")).encode('utf-8')

Answer1:

We usually encounter this problem here when we are trying to .encode() an already encoded byte string. So you might try to decode it first as in

html = urllib.urlopen(link).read() unicode_str = html.decode(<source encoding>) encoded_str = unicode_str.encode("utf8")

As an example:

html = '\xae' encoded_str = html.encode("utf8")

Fails with

UnicodeDecodeError: 'ascii' codec can't decode byte 0xa0 in position 0: ordinal not in range(128)

While:

html = '\xae' decoded_str = html.decode("windows-1252") encoded_str = decoded_str.encode("utf8") print encoded_str ®

Succeeds without error. Do note that "windows-1252" is something I used as an <em>example</em>. I got this from chardet and it had 0.5 confidence that it is right! (well, as given with a 1-character-length string, what do you expect) You should change that to the encoding of the byte string returned from .urlopen().read() to what applies to the content you retrieved.

Recommend

  • Copying data from a MS Access form into Excel
  • Changing all my website links: what is the best way? [duplicate]
  • Rails ajax form not submitting from within another
  • Oracle SQL - Pivot table rows to column and use sub query in pivot
  • BeautifulSoup: Can't convert NavigableString to string
  • Conditional Formatting in VBA, based on functions
  • Using BeautifulSoup Library with Python
  • How to display international scripts in QLabels?
  • BeautifulSoup and   [duplicate]
  • Convert from hex-encoded CLOB to BLOB in Oracle
  • Access PCF DEV from external machine on same network as host
  • php DOMDocument - manipulating and encoding
  • SSRS 2008 - Sorting within a group
  • Where these are stored?
  • Getting NullPointer exception with File.listfiles()
  • How Get arguments value using inline assembly in C without Glibc?
  • How to make R's read_csv2() recognise the text characters properly
  • Convert Type Decimal to Hex (string) in .NET 3.5
  • Implementation of State Monad
  • How do I pass the string value parameter of the selected list item from an auto-populated dropdown l
  • copying resource to sdcard gives a damaged file in android
  • Extracting HTML between tags
  • Django: Count of Group Elements
  • FileReader+canvas image loading problem
  • Insert into database using onclick function
  • Master page gives error
  • Deselecting radio buttons while keeping the View Model in synch
  • Why HTML5 Canvas with a larger size stretch a drawn line?
  • How to redirect a user to a different server and include HTTP basic authentication credentials?
  • ORA-29908: missing primary invocation for ancillary operator
  • How do you troubleshoot character encoding problems?
  • How to get next/previous record number?
  • align graphs with different xlab
  • Return words with double consecutive letters
  • Matrix multiplication with MKL
  • Android Studio and gradle
  • How do you join a server to an Active Directory (domain)?
  • How does Linux kernel interrupt the application?
  • Reading document lines to the user (python)
  • Python/Django TangoWithDjango Models and Databases