25159

Help with Java Swing HTML parsing

Question:

I am parsing a collection of HTML documents with the Java Swing HTML parsing libraries and I am trying to isolate the text between <title> tags so that I can use them to identify the documents but I am having a hard time accomplishing that since the handleStartTag method doesn't have access to the text inside of the tags

Answer1:

You can use XPath to pull out data from HTML:

String html = //... //read the HTML into a DOM StreamSource source = new StreamSource(new StringReader(html)); DOMResult result = new DOMResult(); Transformer transformer = TransformerFactory.newInstance().newTransformer(); transformer.transform(source, result); Node root = result.getNode(); //use XPath to get the title XPath xpath = XPathFactory.newInstance().newXPath(); String title = xpath.evaluate("/html/title", root);

However, the HTML must be well formed XHTML for this to work. For example, the "<br>" tag is valid in HTML, but is invalid in XHTML because it is not closed. It must be "<br />" to be valid in XHTML.

Recommend

  • Understanding jQuery plugin development pattern
  • how to calculate shannon entropy of byte bigrams
  • Python : Find tuples from a list of tuples having duplicate data in the 0th element(of the tuple)
  • Displaying inference tree node values with “print”
  • How do I retrieve the text in a table column using Selenium RC?
  • Failing to get duration of youtube video using xpath
  • Searching an XML file using PHP [closed]
  • parsing xml and html page with lxml and requests package in python
  • How to access meteor package name inside package?
  • How to get latest version of a artifact on Bintray using JSONP
  • Tell Git to stop prompting me for conflicts when none really exist?
  • Limiting recursion to certain level - Duplicate rows
  • java inputstream
  • How to set ini file attributes during an Inno install
  • ilmerge with a PFX file
  • Cannot connect to cassandra from Spark
  • Nant, Vault & Windows Integrated Authentication
  • Does CUDA 5 support STL or THRUST inside the device code?
  • Javascript Callbacks with Object constructor
  • Trying to switch camera back to front but getting exception
  • Apache 2.4 - remove | delete | uninstall
  • Return words with double consecutive letters
  • Run Powershell script from inside other Powershell script with dynamic redirection to file
  • How to get icons for entities from eclipse?
  • Windows forms listbox.selecteditem displaying “System.Data.DataRowView” instead of actual value
  • InvalidAuthenticityToken between subdomains when logging in with Rails app
  • Unit Testing MVC Web Application in Visual Studio and Problem with QTAgent
  • SQL merge duplicate rows and join values that are different
  • Proper way to use connect-multiparty with express.js?
  • Benchmarking RAM performance - UWP and C#
  • Load html files in TinyMce
  • Free memory of cv::Mat loaded using FileStorage API
  • Angular 2 constructor injection vs direct access
  • Can Visual Studio XAML designer handle font family names with spaces as a resource?
  • LevelDB C iterator
  • How can I remove ASP.NET Designer.cs files?
  • Are Kotlin's Float, Int etc optimised to built-in types in the JVM? [duplicate]
  • Can't mass-assign protected attributes when import data from csv file
  • JaxB to read class hierarchy
  • Programmatically clearing map cache