Question

我是pyspark的新手，我有两个单独的.txt文件，每个文件都是一个CSV文件，以制表符作为分隔符。

因此，此文件“文件A”中的一个包含用户ID和我要计数的某个字段，另一个“文件B”中包含有关用户的年龄和性别的信息，它还包含用户ID当然（更确切的说，这些文件可以在这里找到：https://www.kaggle.com/c/kddcup2012-track2/data，但并不重要）。

我想做的是计算文件A中要计数的字段，并按性别和年龄分组。

所以输出将是：

count(value), sex, age
count(value), sex, age 
...

一个sql查询基本上是：

从a，b中选择count（val），性别，年龄，其中a.userid = b.userid按性别，年龄分组；

我正在pyspark上做

path = "/user/root/test.txt"
lines = sc.textFile(path)

# to read the file then:
# being the first field the val to be counted the second one the userid
data = lines.map(lambda l: l.split()).map(lambda l: (float(l[0]), int(l[1]))).collect()

我知道我现在可以通过执行data.groupby ..（field ..）b来进行groupby。

但是我真的不知道如何继续，因为我通过谷歌搜索找到了一些示例，但是当进行分组时，它们都希望使用单个文件

如何在pyspark中对多个文件执行groupBy？

0 个答案: