87844

hadoop job to split xml files

Question:

I've got 1000's of files to process. Each file consists of 1000's of XML files concatenated together.

I'd like to use Hadoop to split each XML file separately. What would be a good way of doing this using Hadoop?

<strong>NOTES:</strong> I am total Hadoop newbie. I plan on using Amazon EMR.

Answer1:

Check out <a href="http://people.apache.org/~isabel/mahout_site/mahout-examples/apidocs/org/apache/mahout/classifier/bayes/XmlInputFormat.html" rel="nofollow">Mahout's XmlInputFormat</a>. It's a shame that this is in Mahout and not in the core distribution.

Are the XML files that are concatenated at least in the same format? If so, you set START_TAG_KEY and END_TAG_KEY to the root in each of your files. Each file will show up as one Text record in the map. Then, you can use your favorite Java XML parser to finish the job.

Recommend

  • Model creation for User User collanborative filtering
  • How to do text classification with label probabilities?
  • How to integrate Ganglia for Spark 2.1 Job metrics, Spark ignoring Ganglia metrics
  • Slide panel up after x seconds, close on click - SET COOKIE
  • R mlogit model, computationally singular
  • parsing XML configuration file using Etree in python
  • Javascript variables for Get http
  • Hadoop/map-reduce: Total time spent by all maps in occupied slots vs. Total time spent by all map ta
  • Angular Databinding doesnt Work
  • Laravel include causes error: Method Illuminate\\View\\View::__toString() must not throw an exce
  • Special chars in Amazon S3 keys?
  • C++/CLI Thread synchronization including managed and unmanaged code
  • Iterate twice through a DataReader
  • Cordova Apache wrong module path
  • Is is safe to use HSQLDB for production? (JBoss AS5.1)
  • SQL: Getting the physical size of a subset of a table
  • Why does java tzupdater add leap seconds?
  • pip in virtualenv gets ConnectTimeoutError
  • android google indoor map
  • Marklogic : Query response time is very high
  • With Hadoop, can I create a tasktracker on a machine that isn't running a datanode?
  • How to avoid particles glitching together in an elastic particle collision simulator?
  • Recording logins for password protected directories
  • Splitting given String into two variables - php
  • Different response to non-authenticated users and AJAX calls
  • C# - Serializing and deserializing static member
  • How to set/get protobuf's extension field in Go?
  • Incrementing object id automatically JS constructor (static method and variable)
  • Check if a string to interpolate provides expected placeholders
  • Sending data from AppleScript to FileMaker records
  • Running a C# exe file
  • Symfony2: How to get request parameter
  • Do create extension work in single-user mode in postgres?
  • Google cloud sdk not working when python points python3
  • Why winpcap requires both .lib and .dll to run?
  • Run Powershell script from inside other Powershell script with dynamic redirection to file
  • File upload with ng-file-upload throwing error
  • Load html files in TinyMce
  • How can I get HTML syntax highlighting in my editor for CakePHP?
  • coudnt use logback because of log4j