Question

我想要一些Spark过滤器和转换，因为我有样本数据，

657483, 888888, 9
657483, 888889, 10
657484, 888888, 20
657484, 888889, 30

对于每个x[0]，我想检查条件以匹配x[1] == '888888' && x[1] == '888889'并获得相对x [2]，以便输出看起来像，

657483,9,10
657484,20,30

我想使用spark map，filter transformation来做到这一点。所以我试过

result = (file1
    .map(lambda x: (x.split(",")[0],x))
    .groupByKey()
    .map(lambda x: (x[0], list(x[1])))  
    .sortByKey('true')
    .coalesce(1).map(lambda line: (line[0], if(line[1] == "888888"), and (line[1] == "888889"))).saveAsTextFile('hdfs://localhost:9000/filter'))

它给我的结果如，

657483,false,false

657484,false,false

如何提取包含x[0]的{{1}}和x[2]。如果条件过滤结果，我们如何应用。

Answer 1

def filterfunct(x):
    if (len(x[1]) != 2):
        return false
    else:
        if (x[1][0][0] == 888888 and x[1][1][0] == 888889) or (x[1][1][0] == 888888 and x[1][0][0] == 888889) :
           return true
        else:
           return false
def mapfunct(x):
    if (x[1][0][0] == 888888):
        return (x[0],x[1][0][1],x[1][1][1])
    else:
        return (x[0],x[1][1][1],x[1][0][1])





result = (file1
.map(lambda x: (x.split(",")[0],(int(x.split(",")[1]),int(x.split(",")[2]))))
.groupByKey()
.map(lambda x: (x[0], filter(lambda y: y[0]==888888 or y[0]==888889, list(x[1]))))  
.filter(filterfunct)
.map(mapfunct)  
.sortByKey('true')
.saveAsTextFile('hdfs://localhost:9000/filter'))

groupByKey()会得到类似{(657483,[(888888, 9),(888889, 10)]}的结果，其中(x,y)是元组，[x,y]是列表。但是，您不知道列表的构造顺序（大多数情况下它遵循它们的读取顺序，但如果两个连续的行最终在不同的分区中，则可能会使它们反转）

Spark操作map，flatmap，filter，ReduceByKey，使用python

1 个答案: