48523

How to re-run whole map/reduce in hadoop before job completion?

Question:

I using Hadoop Map/Reduce using Java

Suppose, I have completed a whole map/reduce job. Is there any way I could repeat the whole map/reduce part only, without ending the job. I mean, I DON'T want to use any chaining of the different jobs but only only want the map/reduce part to repeat.

Thank you!

Answer1:

So I am more familiar with hadoop streaming APIs but approach should translate to the native APIs.

In my understanding what you are trying to do is run the several iterations of same map() and reduce() operations on the input data.

Lets say your initial map() input data comes from file input.txt and the output file is output + {iteration}.txt (where iteration is loop count, iteration =[0, # of iteration)). In the second invocation of the map()/reduce() your input file is output+{iteration} and output file would become output+{iteration +1}.txt.

Let me know if this is not clear, I can conjure up a quick example and post a link here.

<em><strong>EDIT</strong></em>* So for Java I modified the hadoop wordcount example to run multiple times

package com.rorlig; import java.io.IOException; import java.util.StringTokenizer; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.Mapper; import org.apache.hadoop.mapreduce.Reducer; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; public class WordCountJob { public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable>{ private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(Object key, Text value, Context context ) throws IOException, InterruptedException { StringTokenizer itr = new StringTokenizer(value.toString()); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); context.write(word, one); } } } public static class IntSumReducer extends Reducer<Text,IntWritable,Text,IntWritable> { private IntWritable result = new IntWritable(); public void reduce(Text key, Iterable<IntWritable> values, Context context ) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } result.set(sum); context.write(key, result); } } public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); if (args.length != 3) { System.err.println("Usage: wordcount <in> <out> <iterations>"); System.exit(2); } int iterations = new Integer(args[2]); Path inPath = new Path(args[0]); Path outPath = null; for (int i = 0; i<iterations; ++i){ outPath = new Path(args[1]+i); Job job = new Job(conf, "word count"); job.setJarByClass(WordCountJob.class); job.setMapperClass(TokenizerMapper.class); job.setCombinerClass(IntSumReducer.class); job.setReducerClass(IntSumReducer.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); FileInputFormat.addInputPath(job, inPath); FileOutputFormat.setOutputPath(job, outPath); job.waitForCompletion(true); inPath = outPath; } } }

Hope this helps

Recommend

  • Python Error “TypeError: unorderable types: list()
  • Hadoop use only master node for processing data
  • How to identify PP-tags/NP-tags/VP-tags in openNLP chunker?
  • Having trouble adding an int to a string, tried using sprintf but I'm having trouble
  • Understanding recursion in Java a little better
  • Microsoft.Build.Utilities.ToolLocationHelper error on TeamCity
  • Not seeing logs from onEdit trigger
  • Dependency conflict in integrating with Cloudera Hbase 1.0.0
  • cannot load gems in test environment
  • php show all images in directory and sort by last modified
  • EF 4.1 DBContext AutoDetectChangesEnabled
  • where do I find the xml.dom python package for the python-2.6.0-8.9.28 and I have a suse/x86_64 vers
  • Using Sax parsing to edit and write XML in VB6
  • Spring boot 2.0.0.M4 required a bean named 'entityManagerFactory' that could not be found
  • Read text file that is not in the main package in a runnable jar
  • Roxygen error “Skipping invalid path”
  • Webgrid not refreshing after delete MVC
  • Jquery UI tool tip close icon
  • Atlas images wrong size on iPad iOS 9
  • Extracting HTML between tags
  • Django: Count of Group Elements
  • Repeat a vertical line on every page in Report Builder / SSRS
  • Android screen density dpi vs ppi
  • NetLogo BehaviorSpace - Measure runs using reporters
  • Fetching methods from BroadcastReceiver to update UI
  • Bug in WPF DataGrid
  • Cross-Platform Protobuf Serialization
  • Sending data from AppleScript to FileMaker records
  • MySQL WHERE-condition in procedure ignored
  • Symfony2: How to get request parameter
  • ActionScript 2 vs ActionScript 3 performance
  • GridView Sorting works once only
  • Traverse Array and Display in markup
  • How to format a variable of double type
  • VB.net deserialize, JSON Conversion from type 'Dictionary(Of String,Object)' to type '
  • WPF Applying a trigger on binding failure
  • coudnt use logback because of log4j
  • Qt: Run a script BEFORE make
  • JaxB to read class hierarchy
  • java string with new operator and a literal