我需要处理9 GB的CSV文件。在MR期间,它必须进行一些分组并为遗留系统生成特殊格式。
输入文件如下所示:
AppId;Username;Other Fields like timestamps...
app/10;Mr Foobar;...
app/10;d0x;...
app/10;Mr leet;...
app/110;kr1s;...
app/110;d0x;...
...
Outputfile非常简单:
app/10;3;Mr Foobar;d0x;Mr leet
app/110;2;kr1s;d0x
^ ^ ^^^^^^^^
\ AppId \ \ A list with all users playing the game
\
\ Ammount of users
为了解决这个问题,我写了一个mapper,它将AppId
返回为键,将Username
返回为值。有了这个,映射阶段运行良好。
问题发生在减少阶段。我会得到一个Iterator<Text> userIds
,其中可能包含一个包含大量userIds的列表(&gt; 5.000.000)。
要处理的reducer如下所示:
public class UserToAppReducer extends Reducer<Text, Text, Text, UserSetWritable> {
final UserSetWritable userSet = new UserSetWritable();
@Override
protected void reduce(final Text appId, final Iterable<Text> userIds, final Context context) throws IOException, InterruptedException {
this.userSet.clear();
for (final Text userId : userIds) {
this.userSet.add(userId.toString());
}
context.write(appId, this.userSet);
}
}
UserSetWritable
是一个自定义可写,用于存储用户列表。这是生成输出所必需的(key = appId,value =用户名列表)。
这是当前UserSetWritable
的样子:
public class UserSetWritable implements Writable {
private final Set<String> userIds = new HashSet<String>();
public void add(final String userId) {
this.userIds.add(userId);
}
@Override
public void write(final DataOutput out) throws IOException {
out.writeInt(this.userIds.size());
for (final String userId : this.userIds) {
out.writeUTF(userId);
}
}
@Override
public void readFields(final DataInput in) throws IOException {
final int size = in.readInt();
for (int i = 0; i < size; i++) {
this.userIds.add(readUTF);
}
}
@Override
public String toString() {
String result = "";
for (final String userId : this.userIds) {
result += userId + "\t";
}
result += this.userIds.size();
return result;
}
public void clear() {
this.userIds.clear();
}
}
有了这个approche,我得到了一个Java HeapOutOfMemory Exception。
Error: Java heap space
attempt_201303072200_0016_r_000002_0: WARN : mapreduce.Counters - Group org.apache.hadoop.mapred.Task$Counter is deprecated. Use org.apache.hadoop.mapreduce.TaskCounter instead
attempt_201303072200_0016_r_000002_0: WARN : org.apache.hadoop.conf.Configuration - session.id is deprecated. Instead, use dfs.metrics.session-id
attempt_201303072200_0016_r_000002_0: WARN : org.apache.hadoop.conf.Configuration - slave.host.name is deprecated. Instead, use dfs.datanode.hostname
attempt_201303072200_0016_r_000002_0: FATAL: org.apache.hadoop.mapred.Child - Error running child : java.lang.OutOfMemoryError: Java heap space
attempt_201303072200_0016_r_000002_0: at java.util.Arrays.copyOfRange(Arrays.java:3209)
attempt_201303072200_0016_r_000002_0: at java.lang.String.<init>(String.java:215)
attempt_201303072200_0016_r_000002_0: at java.nio.HeapCharBuffer.toString(HeapCharBuffer.java:542)
attempt_201303072200_0016_r_000002_0: at java.nio.CharBuffer.toString(CharBuffer.java:1157)
attempt_201303072200_0016_r_000002_0: at org.apache.hadoop.io.Text.decode(Text.java:394)
attempt_201303072200_0016_r_000002_0: at org.apache.hadoop.io.Text.decode(Text.java:371)
attempt_201303072200_0016_r_000002_0: at org.apache.hadoop.io.Text.toString(Text.java:273)
attempt_201303072200_0016_r_000002_0: at com.myCompany.UserToAppReducer.reduce(UserToAppReducer.java:21)
attempt_201303072200_0016_r_000002_0: at com.myCompany.UserToAppReducer.reduce(UserToAppReducer.java:1)
attempt_201303072200_0016_r_000002_0: at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:164)
attempt_201303072200_0016_r_000002_0: at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:610)
attempt_201303072200_0016_r_000002_0: at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:444)
attempt_201303072200_0016_r_000002_0: at org.apache.hadoop.mapred.Child$4.run(Child.java:268)
attempt_201303072200_0016_r_000002_0: at java.security.AccessController.doPrivileged(Native Method)
attempt_201303072200_0016_r_000002_0: at javax.security.auth.Subject.doAs(Subject.java:396)
attempt_201303072200_0016_r_000002_0: at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408)
attempt_201303072200_0016_r_000002_0: at org.apache.hadoop.mapred.Child.main(Child.java:262)
UserToAppReducer.java:21
就是这一行:this.userSet.add(userId.toString());
在同一个集群上,我可以用这个猪脚本处理数据:
set job.name convertForLegacy
set default_parallel 4
data = load '/data/...txt'
using PigStorage(',')
as (appid:chararray,uid:chararray,...);
grp = group data by appid;
counter = foreach grp generate group, data.uid, COUNT(data);
store counter into '/output/....' using PigStorage(',');
那么如何用MapReduce解决这个OutOfMemoryException?
答案 0 :(得分:1)
写出'大'值的类似问题:Handling large output values from reduce step in Hadoop
除了使用此概念写出大型记录(获取您想要100,000个用户的CSV列表)之外,您还需要使用复合键(应用程序ID和用户ID)和自定义分区程序来确保单个App ID的所有密钥都可以转到reducer。
有些像this gist(未经测试)。