Input:
输入为 InputFormat产生的 InputSplit
The Hadoop MapReduce framework spawns one map task for each InputSplit generated by the InputFormat for the job
Job.setMapperClass(Class) :
设定mapper的API,需要提供map函数
pass mapper to Job
and call map(WritableComparable, Writable, Context) for each key/value pair in the InputSplit.
( hint: Applications can then override the cleanup(Context) method to perform any required cleanup. )
Job.setGroupingComparatorClass(Class)
设定分组时的判别方式(用于map之后的分组)
(e.g. https://www.cnblogs.com/xuxm2007/archive/2011/09/03/2165805.html)
Job.setCombinerClass(Class)
设定Combiner
The number of maps = the total number of blocks of the input files
每个节点并行执行10-100个map tasks比较合理
Reducer
Job.setNumReduceTasks(int)
设定reduce tasks 数
Job.setReducerClass(Class)
reduce(WritableComparable, Iterable<Writable>, Context)
意义类似map
shuflle&sort: 将map的输出按key排序后分片
Job.setSortComparatorClass(Class)
设定分组时的判别方式(用于多个map outputs merge的分组)
The output of the Reducer is not sorted.
The right number of reduces: 0.95 or 1.75 multiplied by (<no. of nodes> * <no. of maximum containers per node>).
( 0.95: 所有reduces在所有map结束后可以立刻开始迁移数据和处理
1.75: 更有效的保持迁移数据和reduce处理的平衡 (虽然增加了开销,但保持了迁移和处理的平衡,降低了失败时的额外开销)
*略小于整数是为了给失败的task留一点slot )
如果reduce task数设置为0,则直接将map之后output放入文件系统中的给定位置。
Partitioner决定如何给reduce之前的结果分片,默认为HashPartitioner
可以使用Counter 来 report map/reduce 的统计信息。
转载于:https://www.cnblogs.com/gonens/p/10712490.html