我是Spark编程的新手,我试图找出一个字符串在一个文件中出现的次数。 这是我的输入:
-------------
2017-04-13 15:56:57.147::ProductSelectPanel::1291::PRODUCT_SALE_ENTRY::INAPHYD00124::1::CROC0008::CROCIN 120MG 60ML SYP::5::0::
2017-04-13 15:57:01.008::ProductSelectPanel::1599::PRODUCT_SALE_WITH_BARCODE::INAPHYD00124::1::CROC0008::CROCIN 120MG 60ML SYP::4::1::1013065197
2017-04-13 15:57:09.182::ProductSelectPanel::1118::ALTERNATIVE_PRODUCT_ENTRY::INAPHYD00124::1::CROC0005::CROCIN 500MG TAB::0
2017-04-13 15:57:15.153::ProductSelectPanel::1121::NO_STOCK_PRODUCT::INAPHYD00124::1::CROC0005::CROCIN 500MG TAB::0::0::
2017-04-13 15:57:19.696::ProductSelectPanel::1118::ALTERNATIVE_PRODUCT_ENTRY::INAPHYD00124::1::CROC0005::CROCIN 500MG TAB::0
2017-04-13 15:57:23.190::ProductSelectPanel::1291::PRODUCT_SALE_ENTRY::INAPHYD00124::1::CALP0005::CALPOL 500MG TAB::110::0::
2017-04-13 15:56:57.147::ProductSelectPanel::1291::PRODUCT_SALE_ENTRY::INAPHYD00124::1::CROC0008::CROCIN 120MG 60ML SYP::5::0::
2017-04-13 15:57:01.008::ProductSelectPanel::1599::PRODUCT_SALE_WITH_BARCODE::INAPHYD00124::1::CROC0008::CROCIN 120MG 60ML SYP::4::1::1013065197
2017-04-13 15:57:09.182::ProductSelectPanel::1118::ALTERNATIVE_PRODUCT_ENTRY::INAPHYD00124::1::CROC0005::CROCIN 500MG TAB::0
2017-04-13 15:57:15.153::ProductSelectPanel::1121::NO_STOCK_PRODUCT::INAPHYD00124::1::CROC0005::CROCIN 500MG TAB::0::0::
2017-04-13 15:57:19.696::ProductSelectPanel::1118::ALTERNATIVE_PRODUCT_ENTRY::INAPHYD00124::1::CROC0005::CROCIN 500MG TAB::0
2017-04-13 15:57:23.190::ProductSelectPanel::1291::PRODUCT_SALE_ENTRY::INAPHYD00124::1::CALP0005::CALPOL 500MG TAB::110::0::
2017-04-13 15:56:57.147::ProductSelectPanel::1291::PRODUCT_SALE_ENTRY::INAPHYD00124::1::CROC0008::CROCIN 120MG 60ML SYP::5::0::
2017-04-13 15:57:01.008::ProductSelectPanel::1599::PRODUCT_SALE_WITH_BARCODE::INAPHYD00124::1::CROC0008::CROCIN 120MG 60ML SYP::4::1::1013065197
2017-04-13 15:57:09.182::ProductSelectPanel::1118::ALTERNATIVE_PRODUCT_ENTRY::INAPHYD00124::1::CROC0005::CROCIN 500MG TAB::0
.......
My Spark程序是这样的。
final Function<String, List<String>> LINE_MAPPER=new Function<String, List<String>>() {
@Override
public List<String> call(String line) throws Exception {
String[] lineArary=line.split("::");
return Arrays.asList(lineArary[3],lineArary[6]);
}
};
final PairFunction<String, String, Integer> word_paper=new PairFunction<String, String, Integer>() {
@Override
public Tuple2<String, Integer> call(String word) throws Exception {
return new Tuple2<String, Integer>(word, Integer.valueOf(1));
}
};
JavaRDD<List<String>> javaRDD =lineRDD.map(LINE_MAPPER);
After doing map transformation i am getting like this:
[[PRODUCT_SALE_ENTRY,CROC0008],[NO_STOCK_PRODUCT,CROC0005],[NO_STOCK_PRODUCT,CROC0005],[PRODUCT_SALE_WITH_BARCODE,CROC0008],[PRODUCT_SALE_WITH_BARCODE,CROC0005],[PRODUCT_SALE_WITH_BARCODE,CROC003],....]
but i want the result like..
[[NO_STOCK_PRODUCT,[CROC0005,4]],[PRODUCT_SALE_WITH_BARCODE,[CROC0008,2]],[PRODUCT_SALE_WITH_BARCODE,[CROC0005,1]],....]
请帮帮我。 提前谢谢。
答案 0 :(得分:0)
看起来您需要将每个键+字符串对视为复合键,并计算该复合键的出现次数。
您可以使用countByValue()
执行此类操作(请参阅JavaDoc)。但是,正如文档所说:
请注意,只有在生成的地图为的情况下才能使用此方法 预计会很小,因为整个事情都被装进了司机 记忆。要处理非常大的结果,请考虑使用rdd.map(x =&gt;(x, 1L))。reduceByKey(_ + _)...
所以,只需map
每个值(例如[PRODUCT_SALE_ENTRY,CROC0008]
到一对表格((PRODUCT_SALE_ENTRY,CROC0008),1L),然后reduceByKey()
(例{{3} }})。
我只在Scala中完成此操作,而不是Java - 我认为您可能需要使用mapToPair()
,例如:如图所示here。这将提供以下形式的RDD:
((NO_STOCK_PRODUCT,CROC0005), 4),
((PRODUCT_SALE_WITH_BARCODE,CROC0008), 2),
((PRODUCT_SALE_WITH_BARCODE,CROC0005), 1),
...
接近你所要求的。
答案 1 :(得分:0)
Thank you DNA, Its works great.
finally my code like that:
JavaPairRDD<String, String> keyValuePairs = lineRDD.mapToPair(obj -> {
String[] split = obj.split("::");
return new Tuple2<String, String>(split[3],split[6]);
});
JavaPairRDD<Tuple2<String, String>, Integer> newRFDD=keyValuePairs.mapToPair(obj -> {
return new Tuple2<Tuple2<String, String>, Integer>(new Tuple2<>(obj._1, obj._2),1);
});
JavaPairRDD<Tuple2<String, String>, Integer> result = newRFDD.reduceByKey((v1, v2) -> {
return v1+v2;
});
result.map(f->{ return f._1._2()+"\t"+f._2()+"\t"+f._1._1(); }).saveAsTextFile("file:///home/charan/offlinefiles/result");
System.out.println("result :"+result.take(10));
and output would be:
CROC0005 620 NO_STOCK_PRODUCT
CROC2107 15 PRODUCT_SALE_ENTRY
CROC2120 7 NO_STOCK_PRODUCT
CROC0229 2 NO_STOCK_PRODUCT
CROC0009 1 NO_STOCK_PRODUCT
CROC0005 1250 ALTERNATIVE_PRODUCT_ENTRY
CROC2302 2 ALTERNATIVE_PRODUCT_ENTRY
CROC2807 5 PRODUCT_SALE_ENTRY
CROC0213 2 ALTERNATIVE_PRODUCT_ENTRY
CROC20221 18 ALTERNATIVE_PRODUCT_ENTRY.