如何在python中展平RDD?

时间:2018-06-21 08:37:15

标签: python pyspark

我有一个垃圾邮件味精的数据集,它具有以下数据类型:

pyspark.rdd.PipelinedRDD

当我做spams.take(3)时,我得到:

[["Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's"], ['WINNER!! As a valued network customer you have been selected to receivea £900 prize reward! To claim call 09061701461. Claim code KL341. Valid 12 hours only.'], ['Had your mobile 11 months or more? U R entitled to Update to the latest colour mobiles with camera for Free! Call The Mobile Update Co FREE on 08002986030']]

如您所见,它的括号内将列表中的每个元素分开。如何摆脱那些括号?我尝试了多种方法来展平它,但是似乎没有任何效果。

4 个答案:

答案 0 :(得分:2)

您可以使用rdd的flatMap方法。它使您可以从一行中生成多行。

spams.flatMap(lambda x:x).take(3)

答案 1 :(得分:1)

由于您不清楚问题是要删除列表中的之后还是之前,并且其他用户已经回答了之后,我将在数据仍为rdd时回答。很简单,

spams = spams.map(lambda x:x[0])
print spams.take(3)

这将删除内部的“括号”。

答案 2 :(得分:0)

这些代码行会有所帮助。

    >>> msg = [["Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 0
8452810075over18's"],
...  ['WINNER!! As a valued network customer you have been selected to receivea £900 prize reward! To claim call 09061701461. Claim code KL341. Valid
12 hours only.'],
...  ['Had your mobile 11 months or more? U R entitled to Update to the latest colour mobiles with camera for Free! Call The Mobile Update Co FREE on
08002986030']]
>>> msg = [x[0] for x in msg]
>>> msg
["Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075o
ver18's", 'WINNER!! As a valued network customer you have been selected to receivea £900 prize reward! To claim call 09061701461. Claim code KL341. Va
lid 12 hours only.', 'Had your mobile 11 months or more? U R entitled to Update to the latest colour mobiles with camera for Free! Call The Mobile Upd
ate Co FREE on 08002986030']

答案 3 :(得分:0)

尝试一个for循环,“数据”是您从spam.take(3)返回的列表。

mylist = []
for entry in data:
  print(entry)
  for e in entry:
    mylist.append(e)
print(mylist)