我想构建一个具有
的Mapreduce输入:
KEY1 \吨的 3,2 | 17412 | 553 | 15186,19199 | 15186,3947 | 15186,5938 | 15186,15517
key2 \ t925 | 10295 | 65182555,7344 | 7344,925 | 10295,7344 | 3,2 key3 \ t8747 | 18466 | 13289 | 3,2 | 13289,5106 | 12222,5106 | 5106,6374........
输出:min \ t(2,3),它是value1的每个元素,value2的每个元素,....和valueN之间的交集。
所以,我设计我的映射器以便
mapper1将包含key1,key2,key3,
的值之间的交叉点mapper2将包含key4,key5,key6 ...
的值之间的交叉点.......
然后我的Reducers再次从这些映射器中获取结果以找到最终的交叉点。所以,基本上我的mapper和reducer使用相同的代码。在我的代码中,我按顺序找到交集,即首先找到value1和value2之间的交集,然后结果将用于与value3交叉,依此类推。
我的Mapper。
映射器-代码1:
public static class MapAPP extends Mapper<Text, Text, Text, Text>{
public static int j=0,k=0;
public static List<String> min_pre = new ArrayList<>();
public static List<String> min_current = new ArrayList<>();
public static Set<String> min_p1 = new HashSet<>();
public static Set<String> min_c1 = new HashSet<>();
public static List<String> min_result = new ArrayList<>();
public static Boolean no_exist_min=false;
public void map(Text key, Text value, Context con) throws IOException, InterruptedException
{
String[] v=value.toString().split("\t");
// aggregate min
if (no_exist_min==false){
if (j==0){
min_pre= Arrays.asList(v[1].toString().trim().split("\\|"));
j=1;
}else{
min_current= Arrays.asList(v[1].toString().trim().split("\\|"));
for (String p: min_pre){
min_p1 = new HashSet<String>(Arrays.asList(p.split(",")));
for (String c: min_current){
min_c1 = new HashSet<String>(Arrays.asList(c.split(",")));
min_c1.retainAll(min_p1);
if (!min_c1.isEmpty()){
Joiner m_comma = Joiner.on(",").skipNulls();
String buff = m_comma.join(min_c1);
if (!min_result.contains(buff))
min_result.add(buff);
}
}
}
if (min_result.isEmpty()){
no_exist_min=true;
} else {
min_pre=new ArrayList(min_result);
min_result.clear();
}
}
}
}
protected void cleanup(Context con) throws IOException, InterruptedException {
Joiner m_pipe = Joiner.on("|").skipNulls();
if (no_exist_min==true){
con.write(new Text("min"), new Text("no_exist"));
}else {
String min_str = m_pipe.join(min_pre);
con.write(new Text("min"), new Text(min_str));
}
}
}
我的减速机(与Mapper几乎相同):
public static class ReduceAPP extends Reducer<Text, Text, Text, Text>
{
public void reduce(Text key, Iterable<Text> values, Context con) throws IOException, InterruptedException
{
List<String> pre = new ArrayList<>();
List<String> current = new ArrayList<>();
Set<String> p1 = new HashSet<>();
Set<String> c1 = new HashSet<>();
List<String> result = new ArrayList<>();
Joiner comma = Joiner.on(",").skipNulls();
Joiner pipe = Joiner.on("|").skipNulls();
Boolean no_exist=false;
int i=0;
// aggregate
for(Text value: values){
if (value.toString().trim()=="no_exist"){
no_exist=true;
break;
}
if (i==0){
pre= Arrays.asList(value.toString().trim().split("\\|"));
i=1;
}else{
current= Arrays.asList(value.toString().trim().split("\\|"));
for (String p: pre){
p1 = new HashSet<String>(Arrays.asList(p.split(",")));
for (String c: current){
c1 = new HashSet<String>(Arrays.asList(c.split(",")));
c1.retainAll(p1);
if (!c1.isEmpty()){
String buff = comma.join(c1);
if (!result.contains(buff))
result.add(buff);
}
}
}
if (result.isEmpty()){
no_exist=true;
break;
}
pre=new ArrayList(result);
result.clear();
}
}
if (no_exist==true){
con.write(key, new Text("no_exist"));
}
else{
String preStr = pipe.join(pre);
con.write(key, new Text(preStr));
}
}
public static <T> Set<T> union(Set<T> setA, Set<T> setB) {
Set<T> tmp = new TreeSet<T>(setA);
tmp.addAll(setB);
return tmp;
}
}
我在小输入文件上运行完美但在大文件中总是内存不足(~450Mb文本文件)。所以,我怀疑我的java代码不是内存效率。在我的Reducers中,我使用了所有局部变量,当这些Reducer函数完成时,这些变量将被销毁,所以我不担心Reducers。但是在我的Mapper中,我必须使用静态变量。在我的Mapper-code1中,我使用了所有静态变量,而在我的Mapper-code2中,我尝试使用尽可能少的静态变量。
我有两个问题?
1)在我的Mapper-code1中,每个静态变量在映射器之间共享,或者它专门用于1个映射器?例如,假设我有5个映射器,是否会创建1个min_pre列表, 5个映射器之间共享还是5个映射器有5个min_pre列表? 我想要的是后者。如何设计我的映射器,以便如果我有5个映射器,将有5个min_pre列表?
2)Mapper-code1和Mapper-code2消耗更少的内存?
映射器-代码2:
public static class MapAPP extends Mapper<Text, Text, Text, Text>{
public static int j=0,k=0;
public static List<String> min_pre = new ArrayList<>();
public static List<String> min_result = new ArrayList<>();
public static Boolean no_exist_min=false;
public void map(Text key, Text value, Context con) throws IOException, InterruptedException
{
String[] v=value.toString().split("\t");
// aggregate min
if (no_exist_min==false){
if (j==0){
min_pre= Arrays.asList(v[1].toString().trim().split("\\|"));
j=1;
}else{
List<String> min_current= Arrays.asList(v[1].toString().trim().split("\\|"));
for (String p: min_pre){
Set<String> min_p1 = new HashSet<String>(Arrays.asList(p.split(",")));
for (String c: min_current){
Set<String> min_c1 = new HashSet<String>(Arrays.asList(c.split(",")));
min_c1.retainAll(min_p1);
if (!min_c1.isEmpty()){
Joiner m_comma = Joiner.on(",").skipNulls();
String buff = m_comma.join(min_c1);
if (!min_result.contains(buff))
min_result.add(buff);
}
}
}
if (min_result.isEmpty()){
no_exist_min=true;
} else {
min_pre=new ArrayList(min_result);
min_result.clear();
}
}
}
}
protected void cleanup(Context con) throws IOException, InterruptedException {
Joiner m_pipe = Joiner.on("|").skipNulls();
if (no_exist_min==true){
con.write(new Text("min"), new Text("no_exist"));
}else {
String min_str = m_pipe.join(min_pre);
con.write(new Text("min"), new Text(min_str));
}
}
}