80088

Remove with DOMxpath or regex?

I use DOMxpath to remove html tags that have empty text node but to keep <br/> tags,

$xpath = new DOMXPath($dom); while(($nodeList = $xpath->query('//*[not(text()) and not(node()) and not(self::br)]')) && $nodeList->length > 0) { foreach ($nodeList as $node) { $node->parentNode->removeChild($node); } }

it works perfectly until I came across another problem,

$content = '<p><br/><br/><br/><br/></p>';

How do remove this kind of messy <br/>and<p>? which means I don't want to allow <br/> alone with <p> but I allow <br/> with proper text like this only,

$content = '<p>first break <br/> second break <br/> the last line</p>';

Is that possible?

Or is it better with a regular expression?

I tried something like this,

$nodeList = $xpath->query("//p[text()=<br\s*\/?>\s*]"); foreach($nodeList as $node) { $node->parentNode->removeChild($node); }

but it return this error,

Warning: DOMXPath::query() [domxpath.query]: Invalid expression in...

Answer1:

You can select the unwanted p using XPath:

"//p[count(*)=count(br) and br and normalize-space(.)='']" <hr>

<strong>Note</strong> to select empty-text nodes shouldn't you better use (?):

"//*[normalize-space(.)='' and not(self::br)]"

This will select any element (but br) whithout text nodes, nodes like:

<p><b/><i/></p>

or

<p> <br/> <br/> </p>

included.

Answer2:

I have almost same situation, i use:

$document->loadHTML(str_replace('<br>', urlencode('<br>'), $string_or_file));

And use urlencode() to change it back for display or inserting to database. Its work for me.

Answer3:

You could get rid of them all by simply checking to see that the only things within a paragraph are spaces and <br /> tags: preg_replace("\<p\>(\s|\<br\s*\/\>)*\<\/p\>","",$content);

Broken down:

\<p\> # Match for <p> ( # Beginning of a group \s # Match a space character | # or... \<br\s*\/\> # match a <br /> tag, with any number (including 0) spaces between the <br and /> )* # Match this whole group (spaces or <br /> tags) 0 or more times. \<\/p\> # Match for </p>

I will mention, however, that unless your HTML is well-formatted (one-line, no strange spaces or paragraph classes, etc), you should not use regex to parse this. If it is, this regex should work just fine.

Recommend

  • Dojox/mvc/at model scope
  • iPhone dealing with xml vs soap vs JSON vs RESTful
  • C++ accessing vector
  • Mongodb update() vs. findAndModify() performace
  • Raphael.js function getBBox give back NAN/NAN/NAN in IE8
  • Generic/Unknown HTTP Error with response code 0 using UnityWebRequest
  • matching similar elements in between two lists
  • How to populate html table with info from list in django
  • Prevent page break in text block with iText, XMLWorker
  • IE10 strips out hashtag from the URL
  • Create a link to a web page that runs a Javascript function on the page
  • Shouldn't else be indented in the below code
  • NHibernate manually control fetching
  • Query to find the duplicates between the name and number in table
  • Eloquent update method change created_at timestamp
  • WPF - CanExecute dosn't fire when raising Commands from a UserControl
  • Yii2: Config params vs. const/define
  • Ajax Loaded meta Tags
  • Using $this when not in object context
  • How do I fake an specific browser client when using Java's Net library?
  • How reduce the height of an mschart by breaking up the y-axis
  • DotNetZip - Calculate final zip size before calling Save(stream)
  • Does CUDA 5 support STL or THRUST inside the device code?
  • Javascript Callbacks with Object constructor
  • Join two tables and save into third-sql
  • Perl system calls when running as another user using sudo
  • Arrow is showed instead of the material design version hamburger icon. Why doesn't syncState in
  • How to model a transition system with SPIN
  • When should I choose bucket sort over other sorting algorithms?
  • ORA-29908: missing primary invocation for ancillary operator
  • How do you troubleshoot character encoding problems?
  • Warning: Can't call setState (or forceUpdate) on an unmounted component
  • Proper folder structure for lots of source files
  • KeystoneJS: Relationships in Admin UI not updating
  • Load html files in TinyMce
  • Free memory of cv::Mat loaded using FileStorage API
  • Understanding cpu registers
  • Is there any way to bind data to data.frame by some index?
  • How can I use `wmic` in a Windows PE script?
  • Converting MP3 duration time