我有一个RDD,可以从当前格式的文件中读取
199.72.81.55--[01 / Jul / 1995:00:00:01 -0400]“ GET / history / apollo / HTTP / 1.0” 200 6245 unicomp6.unicomp.net--[01 / Jul / 1995:00:00:06 -0400]“ GET / shuttle / countdown / HTTP / 1.0” 200 3985 ...
这段代码让我得到了所需的字段:
rdd = rdd.map(lambda x : (x.split(" ")[0], x.split(" ")[3][1:12], x.split(" ")[5], x.split(" ")[7], x.split(" ")[8]))
现在,在此添加中,我已完成此设置
rdd.take(4)
[('199.72.81.55', '01/Jul/1995', '"GET', 'HTTP/1.0"', '200'), ('unicomp6.unicomp.net', '01/Jul/1995', '"GET', 'HTTP/1.0"', '200'), ('199.120.110.21', '01/Jul/1995', '"GET', 'HTTP/1.0"', '200'),('199.120.110.21', '01/Jul/1995', '"GET', 'HTTP/1.0"', '200')]
我需要此文件上的唯一主机,所以我这样做了:
rdd2 = rdd.map(lambda x : x[0])
我明白了:
rdd2(take(4))
['199.72.81.55', 'unicomp6.unicomp.net', '199.120.110.21', '199.120.110.21']
到目前为止,太好了 但是现在我有一个我不明白的麻烦
h = set()
rdd2.map(lambda x: h.add(x)) #line with error, i suppose
print(h)
{} # the set is empty, i have no idea why my set isn't adding the values to the set
我期望的是这样的一组唯一值:
{'199.72.81.55', 'unicomp6.unicomp.net', '199.120.110.21'}
谁能指出我为什么将lambda添加到集合中的值不起作用