I've got 1000's of files to process. Each file consists of 1000's of XML files concatenated together.
I'd like to use Hadoop to split each XML file separately. What would be a good way of doing this using Hadoop?
<strong>NOTES:</strong> I am total Hadoop newbie. I plan on using Amazon EMR.Answer1:
Check out <a href="http://people.apache.org/~isabel/mahout_site/mahout-examples/apidocs/org/apache/mahout/classifier/bayes/XmlInputFormat.html" rel="nofollow">Mahout's XmlInputFormat</a>. It's a shame that this is in Mahout and not in the core distribution.
Are the XML files that are concatenated at least in the same format? If so, you set
END_TAG_KEY to the root in each of your files. Each file will show up as one
Text record in the
map. Then, you can use your favorite Java XML parser to finish the job.