Hadoop MapReduce Java API

mac2022-06-30 25

　　Input：

　　　　输入为 InputFormat产生的 InputSplit

　　　　The Hadoop MapReduce framework spawns one map task for each InputSplit generated by the InputFormat for the job

　　Job.setMapperClass(Class) ：

　　　　设定mapper的API，需要提供map函数

　　　　pass mapper to Job

　　　　and call map(WritableComparable, Writable, Context) for each key/value pair in the InputSplit.

　　　　( hint: Applications can then override the cleanup(Context) method to perform any required cleanup. )

　　Job.setGroupingComparatorClass(Class)

　　　　设定分组时的判别方式（用于map之后的分组）

　　　　（e.g. https://www.cnblogs.com/xuxm2007/archive/2011/09/03/2165805.html）

　　Job.setCombinerClass(Class)

　　　　设定Combiner

　　The number of maps = the total number of blocks of the input files

　　每个节点并行执行10-100个map tasks比较合理

Reducer

　　Job.setNumReduceTasks(int)

　　　　设定reduce tasks 数

　　Job.setReducerClass(Class)

　　　　reduce(WritableComparable, Iterable<Writable>, Context)

　　　　意义类似map

　　shuflle&sort: 将map的输出按key排序后分片

　　　　Job.setSortComparatorClass(Class)　　

　　　　　　设定分组时的判别方式（用于多个map outputs merge的分组）

　　　　The output of the Reducer is not sorted.

　　The right number of reduces: 0.95 or 1.75 multiplied by (<no. of nodes> * <no. of maximum containers per node>).

　　( 0.95: 所有reduces在所有map结束后可以立刻开始迁移数据和处理

　　 1.75: 更有效的保持迁移数据和reduce处理的平衡（虽然增加了开销，但保持了迁移和处理的平衡，降低了失败时的额外开销）

　　　*略小于整数是为了给失败的task留一点slot )

如果reduce task数设置为0，则直接将map之后output放入文件系统中的给定位置。

Partitioner决定如何给reduce之前的结果分片，默认为HashPartitioner

可以使用Counter 来 report map/reduce 的统计信息。

转载于:https://www.cnblogs.com/gonens/p/10712490.html

最新回复(0)