Question

我有一个RDD，可以从当前格式的文件中读取

199.72.81.55--[01 / Jul / 1995：00：00：01 -0400]“ GET / history / apollo / HTTP / 1.0” 200 6245 unicomp6.unicomp.net--[01 / Jul / 1995：00：00：06 -0400]“ GET / shuttle / countdown / HTTP / 1.0” 200 3985 ...

这段代码让我得到了所需的字段：

rdd = rdd.map(lambda x : (x.split(" ")[0], x.split(" ")[3][1:12], x.split(" ")[5], x.split(" ")[7], x.split(" ")[8]))

现在，在此添加中，我已完成此设置

    rdd.take(4)
[('199.72.81.55', '01/Jul/1995', '"GET', 'HTTP/1.0"', '200'), ('unicomp6.unicomp.net', '01/Jul/1995', '"GET', 'HTTP/1.0"', '200'), ('199.120.110.21', '01/Jul/1995', '"GET', 'HTTP/1.0"', '200'),('199.120.110.21', '01/Jul/1995', '"GET', 'HTTP/1.0"', '200')]

我需要此文件上的唯一主机，所以我这样做了：

    rdd2 = rdd.map(lambda x : x[0])

我明白了：

    rdd2(take(4))
    ['199.72.81.55', 'unicomp6.unicomp.net', '199.120.110.21', '199.120.110.21']

到目前为止，太好了但是现在我有一个我不明白的麻烦

    h = set()
    rdd2.map(lambda x: h.add(x))  #line with error, i suppose
    print(h)
    {} # the set is empty, i have no idea why my set isn't adding the values to the set

我期望的是这样的一组唯一值：

{'199.72.81.55', 'unicomp6.unicomp.net', '199.120.110.21'}

谁能指出我为什么将lambda添加到集合中的值不起作用

如何在pyspark中处理RDD而无需迭代

0 个答案: