Remove with DOMxpath or regex?

I use DOMxpath to remove html tags that have empty text node but to keep <br/> tags,

$xpath = new DOMXPath($dom); while(($nodeList = $xpath->query('//*[not(text()) and not(node()) and not(self::br)]')) && $nodeList->length > 0) { foreach ($nodeList as $node) { $node->parentNode->removeChild($node); } }

it works perfectly until I came across another problem,

$content = '<p><br/><br/><br/><br/></p>';

How do remove this kind of messy <br/>and<p>? which means I don't want to allow <br/> alone with <p> but I allow <br/> with proper text like this only,

$content = '<p>first break <br/> second break <br/> the last line</p>';

Is that possible?

Or is it better with a regular expression?

I tried something like this,

$nodeList = $xpath->query("//p[text()=<br\s*\/?>\s*]"); foreach($nodeList as $node) { $node->parentNode->removeChild($node); }

but it return this error,

Warning: DOMXPath::query() [domxpath.query]: Invalid expression in...


You can select the unwanted p using XPath:

"//p[count(*)=count(br) and br and normalize-space(.)='']" <hr>

<strong>Note</strong> to select empty-text nodes shouldn't you better use (?):

"//*[normalize-space(.)='' and not(self::br)]"

This will select any element (but br) whithout text nodes, nodes like:



<p> <br/> <br/> </p>



I have almost same situation, i use:

$document->loadHTML(str_replace('<br>', urlencode('<br>'), $string_or_file));

And use urlencode() to change it back for display or inserting to database. Its work for me.


You could get rid of them all by simply checking to see that the only things within a paragraph are spaces and <br /> tags: preg_replace("\<p\>(\s|\<br\s*\/\>)*\<\/p\>","",$content);

Broken down:

\<p\> # Match for <p> ( # Beginning of a group \s # Match a space character | # or... \<br\s*\/\> # match a <br /> tag, with any number (including 0) spaces between the <br and /> )* # Match this whole group (spaces or <br /> tags) 0 or more times. \<\/p\> # Match for </p>

I will mention, however, that unless your HTML is well-formatted (one-line, no strange spaces or paragraph classes, etc), you should not use regex to parse this. If it is, this regex should work just fine.


  • Dojox/mvc/at model scope
  • iPhone dealing with xml vs soap vs JSON vs RESTful
  • C++ accessing vector
  • Mongodb update() vs. findAndModify() performace
  • Raphael.js function getBBox give back NAN/NAN/NAN in IE8
  • Generic/Unknown HTTP Error with response code 0 using UnityWebRequest
  • matching similar elements in between two lists
  • How to populate html table with info from list in django
  • Prevent page break in text block with iText, XMLWorker
  • IE10 strips out hashtag from the URL
  • Create a link to a web page that runs a Javascript function on the page
  • Shouldn't else be indented in the below code
  • NHibernate manually control fetching
  • Query to find the duplicates between the name and number in table
  • Eloquent update method change created_at timestamp
  • WPF - CanExecute dosn't fire when raising Commands from a UserControl
  • Yii2: Config params vs. const/define
  • Ajax Loaded meta Tags
  • Using $this when not in object context
  • How do I fake an specific browser client when using Java's Net library?
  • How reduce the height of an mschart by breaking up the y-axis
  • DotNetZip - Calculate final zip size before calling Save(stream)
  • Does CUDA 5 support STL or THRUST inside the device code?
  • Javascript Callbacks with Object constructor
  • Join two tables and save into third-sql
  • Perl system calls when running as another user using sudo
  • Arrow is showed instead of the material design version hamburger icon. Why doesn't syncState in
  • How to model a transition system with SPIN
  • When should I choose bucket sort over other sorting algorithms?
  • ORA-29908: missing primary invocation for ancillary operator
  • How do you troubleshoot character encoding problems?
  • Warning: Can't call setState (or forceUpdate) on an unmounted component
  • Proper folder structure for lots of source files
  • KeystoneJS: Relationships in Admin UI not updating
  • Load html files in TinyMce
  • Free memory of cv::Mat loaded using FileStorage API
  • Understanding cpu registers
  • Is there any way to bind data to data.frame by some index?
  • How can I use `wmic` in a Windows PE script?
  • Converting MP3 duration time