Question

我正在尝试获取一个Map<String, Map<String, Long>>输入数据集（csv），其中包含Map a Map的每个元素（数据集的列），其中包含相应列中存在的元素的数量发生。所以有这样的示例输入：

col1,col2,col3
a,1,c6
ab,23,c6
cd,23,c8
a,1,x

我的输出应该是：

{col1:{a:2, ab:1, cd:1}},
{col2:{1:2, 23:2}},
{col3:{c6:2, c8:1, x:1}}

我有办法分别取每列，并使用“countByValue”将元素计为Map，然后将每个Map作为值存储在列的Map中。现在我正在考虑一种通过读取文件来加速计算的方法，并且我尝试在我的文件中使用“flatMapToPair”函数：

JavaRDD<String> fileRdd

像这样：

JavaPairRDD<String, String> res = fileRdd.flatMapToPair(
    new PairFlatMapFunction<String, String, String>() { 
        public Iterator<Tuple2<String, String>> call(String x) {
            List<Tuple2<String, String>> res = new ArrayList<>();
            List<String> d =  Arrays.asList(x.split(",(?=(?:[^\"]*\"[^\"]*\")*[^\"]*$)", -1));
            for (int i = 0; i < columns.size(); i++) {
                res.add(new Tuple2<String, String>(columns.get(i), d.get(i)));
            }
            return res.iterator();
        }
});

然后groupingByKey：

JavaPairRDD<String,Iterable<String>> groupMap = res.groupByKey();

现在我有这样的结果：

col1:[a,ab,cd,a]

我认为我需要另一个地图缩小步骤来计算出现次数，所以这可能不是达到目标的最佳方法......

另外我注意到，对于200MB文件上的第一个flatMapToPair计算，在比上一次计算处理相同文件的时间过后，内存不足，所以我可能会对flatMapToPair做错了。

Answer 1

如果您使用DataFrame而不是RDD，则有一个简单的解决方案。

//import com.fasterxml.jackson.core.JsonGenerator;
//import com.fasterxml.jackson.core.JsonParseException;
//import com.fasterxml.jackson.core.JsonProcessingException;
//import com.fasterxml.jackson.core.type.TypeReference;
//import com.fasterxml.jackson.databind.JsonMappingException;
//import com.fasterxml.jackson.databind.ObjectMapper;

// Read CSV
Dataset<Row> df = spark.read().csv(fileName);
// Initialize ObjectMapper
ObjectMapper mapper = new ObjectMapper();
mapper.configure(JsonGenerator.Feature.QUOTE_FIELD_NAMES, false);

// Map for collecting column information
Map<String, Map<String,Long>> columnCountMap = new HashMap<String, Map<String,Long>>();
for (String columnName : df.columns())
    {
        // Group and count using groupBy function 
        // and then convert to JSON and collect as List
        List<String> jsons = df.groupBy(columnName).count().toJSON().collectAsList();
        try
            {
                Map<String,Long> countMap = new HashMap<String, Long>();
                // Iterate through the strings/rows; 
                // map it to Map then collect values;
                // put them into the countMap
                for (String json : jsons)
                    {
                        Map<String, Object> map = mapper.readValue(json, new TypeReference<Map<String, String>>(){});
                        String[] keyValues = map.values().toArray(new String[map.values().size()]);
                        countMap.put(keyValues[0], Long.parseLong(keyValues[1]));

                    }
                columnCountMap.put(columnName, countMap);
            }
        catch (JsonParseException e)
            {
                e.printStackTrace();
            }
        catch (JsonMappingException e)
            {
                e.printStackTrace();
            }
        catch (IOException e)
            {
                e.printStackTrace();
            }
    }
String output = "":
try
{
        // If you need to output as {col1:{a:2, ab:1, cd:1}},
        // {col2:{1:2, 23:2}},
        // {col3:{c6:2, c8:1, x:1}}
        output = mapper.writeValueAsString(columnCountMap);
}
catch (JsonProcessingException e)
{
       e.printStackTrace();
}

Java Spark计数列的元素频率

1 个答案: