6843

Illegal character in Xml

Question:

I have a PHP file which produces an Xml sitemap based on data which has been imported from a number of sources. My sitemap is currently not well formed due to an illegal character in one line of the imported data however I am struggling to remove it.

The character looks to represent the 'squared' or superscript 2, and is represented as a square. I have tried pasting this into a hex editor however it is shown as a ?, and the hex code also corresponds to ?. I have also tried using iconv to convert from all source encodings to all destination encodings, with no combination removing this character.

I also have the following function to remove non-ascii characters:

function stripInvalidXml($value) { $ret = ""; $current; if (empty($value)) { return $ret; } $length = strlen($value); for ($i=0; $i < $length; $i++) { $current = ord($value{$i}); if (($current == 0x9) || ($current == 0xA) || ($current == 0xD) || (($current >= 0x20) && ($current <= 0xD7FF)) || (($current >= 0xE000) && ($current <= 0xFFFD)) || (($current >= 0x10000) && ($current <= 0x10FFFF))) { if($current != 0x1F) { $ret .= chr($current); } } else { $ret .= " "; } } return $ret; }

However this still is not removing it. If I step through the code the illegal character is expanded out to in eclipses debug window. The string it is having issues with is below (hoping it pastes correctly)

251gm-50

Any ideas on a function which will remove this character and prevent this form occurring are much appreciated - I have little control over the data that is imported so it needs to be done at the point of Xml generation.

<strong>EDIT</strong>

After posting I can see that the character doesn't appear correctly. When viewing in Eclipses window it appears as & # 65535 ; (without spaces - if I leave spaces in it renders the character, which looks like )

Answer1:

You are trying to perform character transcoding. Don't do it by yourself, use the PHP library.

I found iconv quite useful:

$cleanText = iconv('UTF-8','ISO-8859-1//TRANSLIT//IGNORE', $srcText);

This code translates from utf-8 to iso-8859, trying to remap the 'exotic' characters and ignoring the ones that can not be transcoded.

I'm just guessing the source encoding is utf-8. You have to discover which encoding the incoming data is using and translate in the one you are declaring in the XML header.

A linux command line tool that guesses a file's encoding is enca

Answer2:

This is wrong:

$current = ord($value{$i}); if (($current == 0x9) || ($current == 0xA) || ($current == 0xD) || (($current >= 0x20) && ($current <= 0xD7FF)) || (($current >= 0xE000) && ($current <= 0xFFFD)) || (($current >= 0x10000) && ($current <= 0x10FFFF))) { if($current != 0x1F) $ret .= chr($current); }

ord() never returns anything bigger than 0xFF since it works in a byte-by-byte manner.

I'm guessing your XML is invalid because the file contains an invalid UTF-8 sequence (indeed &#65535;, i.e., 0xFFFF, is invalid in UTF-8). This probably comes from copy-paste of different XML files that have different encodings.

I suggest you use the <a href="http://php.net/DOM" rel="nofollow">DOM extension</a> instead to do your XML mash-up, which handles different encodings automatically by converting them internally to UTF-8.

Answer3:

I think I was looking down the wrong path - rather than an encoding issue character was an HTML entity representing the 'squared' symbol. As the descriptions in the URL only exist for search enging purposes I can safely remove all htmlentities with the following regex:

$content = preg_replace("/&#?[a-z0-9]+;/i","",$content);

Recommend

  • Find the distance of each pair between two vectors
  • Verify Encoding tags by Zebra printer RZ400
  • XSLT new lines not being preserved
  • Xquery append text to tag values depending of the element type
  • PDF File generated by POST Request Not Opening
  • weblogic jdbc Datasource error BEA-001131 Closed Connection
  • pandas left join where right is null on multiple columns
  • Dropping factors which levels have observations smaller than a specific value-R
  • Adding Two Numbers from Input
  • Remove Special Chars from a TSV file using Regex
  • Are the data registers EAX, EBX, ECX and EDX interchangeable
  • Why file enumeration using DeviceIoControl is faster in VB.NET than in C++?
  • XSL-FO add new line after each node
  • Encode and decode an Adobe Dreamweaver password in *.ste file
  • How to exclude linebreak-only textnodes from text() XPath query?
  • How to split a file name by dot and get the 2 last portion
  • Counting distinct items in XSLT and listing only once
  • Ray-Sphere intersection [closed]
  • Implement custom JTA XAResource for using with hibernate
  • Implementing logical right shift using only “~ & ^ | + > =” operators and 20 operations
  • Parsing BBCode with xslt 2.0
  • JS: What's the difference between a ! closure and () closure? [duplicate]
  • Impossible to inject JMS QueueConnectionFactory in JBoss 6.1.0 with resource annotation
  • ORA-24778: cannot open connections
  • Convert string in Chinese character to Unicode in Java
  • t-sql most efficient row to column? crosstab for xml path, pivot
  • escaping \" symbol used in findstr within a FOR statement
  • Multiple rows using XSLT
  • Screen Scraping
  • c# - inserting smileys in RichTextBox inserts some and ignores others
  • Error compiling hello world program C with arm-none-eabi-gcc
  • Detect non valid XML characters
  • Rearranging Cells in UITableView Bug & Saving Changes
  • Circular dependency while pushing http interceptor
  • Linker errors when using intrinsic function via function pointer
  • FormattedException instead of throw new Exception(string.Format(…)) in .NET
  • Observable and ngFor in Angular 2
  • How to Embed XSL into XML
  • UserPrincipal.Current returns apppool on IIS
  • Conditional In-Line CSS for IE and Others?