Illegal character in Xml


I have a PHP file which produces an Xml sitemap based on data which has been imported from a number of sources. My sitemap is currently not well formed due to an illegal character in one line of the imported data however I am struggling to remove it.

The character looks to represent the 'squared' or superscript 2, and is represented as a square. I have tried pasting this into a hex editor however it is shown as a ?, and the hex code also corresponds to ?. I have also tried using iconv to convert from all source encodings to all destination encodings, with no combination removing this character.

I also have the following function to remove non-ascii characters:

function stripInvalidXml($value) { $ret = ""; $current; if (empty($value)) { return $ret; } $length = strlen($value); for ($i=0; $i < $length; $i++) { $current = ord($value{$i}); if (($current == 0x9) || ($current == 0xA) || ($current == 0xD) || (($current >= 0x20) && ($current <= 0xD7FF)) || (($current >= 0xE000) && ($current <= 0xFFFD)) || (($current >= 0x10000) && ($current <= 0x10FFFF))) { if($current != 0x1F) { $ret .= chr($current); } } else { $ret .= " "; } } return $ret; }

However this still is not removing it. If I step through the code the illegal character is expanded out to in eclipses debug window. The string it is having issues with is below (hoping it pastes correctly)


Any ideas on a function which will remove this character and prevent this form occurring are much appreciated - I have little control over the data that is imported so it needs to be done at the point of Xml generation.


After posting I can see that the character doesn't appear correctly. When viewing in Eclipses window it appears as & # 65535 ; (without spaces - if I leave spaces in it renders the character, which looks like )


You are trying to perform character transcoding. Don't do it by yourself, use the PHP library.

I found iconv quite useful:

$cleanText = iconv('UTF-8','ISO-8859-1//TRANSLIT//IGNORE', $srcText);

This code translates from utf-8 to iso-8859, trying to remap the 'exotic' characters and ignoring the ones that can not be transcoded.

I'm just guessing the source encoding is utf-8. You have to discover which encoding the incoming data is using and translate in the one you are declaring in the XML header.

A linux command line tool that guesses a file's encoding is enca


This is wrong:

$current = ord($value{$i}); if (($current == 0x9) || ($current == 0xA) || ($current == 0xD) || (($current >= 0x20) && ($current <= 0xD7FF)) || (($current >= 0xE000) && ($current <= 0xFFFD)) || (($current >= 0x10000) && ($current <= 0x10FFFF))) { if($current != 0x1F) $ret .= chr($current); }

ord() never returns anything bigger than 0xFF since it works in a byte-by-byte manner.

I'm guessing your XML is invalid because the file contains an invalid UTF-8 sequence (indeed &#65535;, i.e., 0xFFFF, is invalid in UTF-8). This probably comes from copy-paste of different XML files that have different encodings.

I suggest you use the <a href="http://php.net/DOM" rel="nofollow">DOM extension</a> instead to do your XML mash-up, which handles different encodings automatically by converting them internally to UTF-8.


I think I was looking down the wrong path - rather than an encoding issue character was an HTML entity representing the 'squared' symbol. As the descriptions in the URL only exist for search enging purposes I can safely remove all htmlentities with the following regex:

$content = preg_replace("/&#?[a-z0-9]+;/i","",$content);


  • Find the distance of each pair between two vectors
  • Verify Encoding tags by Zebra printer RZ400
  • XSLT new lines not being preserved
  • Xquery append text to tag values depending of the element type
  • PDF File generated by POST Request Not Opening
  • weblogic jdbc Datasource error BEA-001131 Closed Connection
  • pandas left join where right is null on multiple columns
  • Dropping factors which levels have observations smaller than a specific value-R
  • Adding Two Numbers from Input
  • Remove Special Chars from a TSV file using Regex
  • Are the data registers EAX, EBX, ECX and EDX interchangeable
  • Why file enumeration using DeviceIoControl is faster in VB.NET than in C++?
  • XSL-FO add new line after each node
  • Encode and decode an Adobe Dreamweaver password in *.ste file
  • How to exclude linebreak-only textnodes from text() XPath query?
  • How to split a file name by dot and get the 2 last portion
  • Counting distinct items in XSLT and listing only once
  • Ray-Sphere intersection [closed]
  • Implement custom JTA XAResource for using with hibernate
  • Implementing logical right shift using only “~ & ^ | + > =” operators and 20 operations
  • Parsing BBCode with xslt 2.0
  • JS: What's the difference between a ! closure and () closure? [duplicate]
  • Impossible to inject JMS QueueConnectionFactory in JBoss 6.1.0 with resource annotation
  • ORA-24778: cannot open connections
  • Convert string in Chinese character to Unicode in Java
  • t-sql most efficient row to column? crosstab for xml path, pivot
  • escaping \" symbol used in findstr within a FOR statement
  • Multiple rows using XSLT
  • Screen Scraping
  • c# - inserting smileys in RichTextBox inserts some and ignores others
  • Error compiling hello world program C with arm-none-eabi-gcc
  • Detect non valid XML characters
  • Rearranging Cells in UITableView Bug & Saving Changes
  • Circular dependency while pushing http interceptor
  • Linker errors when using intrinsic function via function pointer
  • FormattedException instead of throw new Exception(string.Format(…)) in .NET
  • Observable and ngFor in Angular 2
  • How to Embed XSL into XML
  • UserPrincipal.Current returns apppool on IIS
  • Conditional In-Line CSS for IE and Others?