I am parsing a collection of HTML documents with the Java Swing HTML parsing libraries and I am trying to isolate the text between
<title> tags so that I can use them to identify the documents but I am having a hard time accomplishing that since the
handleStartTag method doesn't have access to the text inside of the tags
You can use XPath to pull out data from HTML:
String html = //... //read the HTML into a DOM StreamSource source = new StreamSource(new StringReader(html)); DOMResult result = new DOMResult(); Transformer transformer = TransformerFactory.newInstance().newTransformer(); transformer.transform(source, result); Node root = result.getNode(); //use XPath to get the title XPath xpath = XPathFactory.newInstance().newXPath(); String title = xpath.evaluate("/html/title", root);
However, the HTML must be well formed XHTML for this to work. For example, the "<br>" tag is valid in HTML, but is invalid in XHTML because it is not closed. It must be "<br />" to be valid in XHTML.