如何在PySpark的列中找到相同的元素?

时间:2017-05-06 13:54:37

标签: pyspark

我有一个txt文件。有一个包含4列的数据集。第一列的意思是电话号码。我必须找到相同的电话号码。我的txt文件就像这样

...
0544147,23,86,40.761650,29.940929
0544147,23,104,40.768749,29.968599
0538333,21,184,40.764679,29.929543
05477900,21,204,40.773071,29.975010
0561554,23,47,40.764694,29.927397
0556645,24,6,40.821587,29.920273
...

我的代码是

from pyspark import SparkContext

sc = SparkContext()
rdd_data = sc.textFile("dataset.txt")

data1 = []

lines = rdd_data.collect()
lines = [x.strip() for x in lines]

for line in lines:
    data1.append([float(x.strip()) for x in line.split(',')])

column0 = [row[0] for row in data1] #first column is founded as a list 

所以我不知道如何在第一栏中获得相同的电话号码。关于pyspark和python,我太新了。提前谢谢。

1 个答案:

答案 0 :(得分:1)

from pyspark import SparkContext

sc = SparkContext()
rdd_data = sc.textFile("dataset.txt")

rdd_telephone_numbers = rdd_data.map(lambda line:line.split(",")).map(lambda line: int(line[0]))
print (rdd_telephone_numbers.collect()) # [544147, 544147, 538333, 5477900, 561554, 556645]

如果您想要逐步解释数据转换:

from pyspark import SparkContext

sc = SparkContext()
rdd_data = sc.textFile("dataset.txt")

rdd_data_1 = rdd_data.map(lambda line: line.split(","))
# this will transform every row of your dataset
# you had these data in your dataset:
# 0544147,23,86,40.761650,29.940929
# 0544147,23,104,40.768749,29.968599
# ...........
# now you have a single RDD like this: 
# [[u'0544147', u'23', u'86', u'40.761650', u'29.940929'], [u'0544147', u'23', u'104', u'40.768749', u'29.968599'],....]

rdd_telephone_numbers = rdd_data_1.map(lambda line: int(line[0]))
# this will take only the first element of every line of the rdd, so now you have:
# [544147, 544147, 538333, 5477900, 561554, 556645]