31029

how to scrape web page data without losing tags

Question:

I am trying to scrape web data using php and dom xpath. When I store the $node->nodeValue into my database or even if i try to echo it, all the tags like

and <br> are missing. So I am getting all the paras concatenated. How to solve this problem

Answer1:

If you have a node, and you need all its contents as they are, you can use this function:

function innerHTML(DOMNode $node) { $doc = new DOMDocument(); foreach ($node->childNodes as $child) { $doc->appendChild($doc->importNode($child, true)); } return $doc->saveHTML(); }

Answer2:

If you're browsing the DOM, most likely there are no longer tags to see. The tags are now nodes within the DOM -- the raw content contained in tags is all you have access to in "string form". You can, of course, use node information to reconstruct the tags, but they won't be the original tags (e.g., you will have to choose <BR> or <br> - you won't know which the site originally had). If you want the original tags from the get go, get the original stream of bytes returned by the GET/POST you did; don't parse it into a DOM tree.

Recommend

  • html tidy removes empty tags, such as
  • How to add nodes to a multi-level XML from an array?
  • PHP XPath. How to return string with html tags?
  • Showing a concatenated string from multiple values of observable array
  • Extracting infromation from multiple JSON files to single CSV file in python
  • Zend Framework 2, Module Redirect
  • Why does PHP appear to evaluate this condition incorrectly?
  • concatenating select menus into a single form input
  • c# web browser click on a button with a class name
  • Collapsible Sankey Diagram - D3
  • Contact form problem - I do receive messages, but no contents (blank page)
  • WooCommerce hook after order is updated?
  • Number of threads being used during Parallel.ForEach
  • Updating and removing unique join relationships in CakePHP
  • WP7 difficulties binding data to listbox itemssource - won't refresh
  • Is there a way to disable a specific event in kendo ui scheduler?
  • Serve file to user over http via php
  • Efficient algorithm to find additions and removals from 2 collections
  • How to concat Pandas dataframe columns
  • Ruby 1.8.6 Array#uniq not removing duplicate hashes
  • Create function that can pass a parameter without making a new component
  • Alamofire and Reachability.swift not working on xCode8-beta5
  • Enumerating Controls on a Form
  • How can we prepend rows to a react native list-view?
  • Extract All Possible Paths from Expression-Tree and evaluate them to hold TRUE
  • Configure Spring's MappingJacksonHttpMessageConverter
  • XSLT foreach repeating nodes to flat
  • AppleScript : find open tab in safari by name and open it
  • List images(01.png) and descriptions(01.txt) from directory
  • D3 nodes and links from JSON with nested arrays of children
  • Running a C# exe file
  • Join two tables and save into third-sql
  • How to model a transition system with SPIN
  • ORA-29908: missing primary invocation for ancillary operator
  • Why joiner is not used after Sequence generator or Update statergy
  • Exception on Android 4.0 `android.os.StrictMode$AndroidBlockGuardPolicy.onNetwork(StrictMode)`
  • File not found error Google Drive API
  • Is it possible to post an object from jquery to bottle.py?
  • Converting MP3 duration time
  • To Get the radio button value in ruby on rails