我有一个垃圾邮件味精的数据集,它具有以下数据类型:
pyspark.rdd.PipelinedRDD
当我做spams.take(3)
时,我得到:
[["Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's"],
['WINNER!! As a valued network customer you have been selected to receivea £900 prize reward! To claim call 09061701461. Claim code KL341. Valid 12 hours only.'],
['Had your mobile 11 months or more? U R entitled to Update to the latest colour mobiles with camera for Free! Call The Mobile Update Co FREE on 08002986030']]
如您所见,它的括号内将列表中的每个元素分开。如何摆脱那些括号?我尝试了多种方法来展平它,但是似乎没有任何效果。
答案 0 :(得分:2)
您可以使用rdd的flatMap方法。它使您可以从一行中生成多行。
spams.flatMap(lambda x:x).take(3)
答案 1 :(得分:1)
由于您不清楚问题是要删除列表中的之后还是之前,并且其他用户已经回答了之后,我将在数据仍为rdd时回答。很简单,
spams = spams.map(lambda x:x[0])
print spams.take(3)
这将删除内部的“括号”。
答案 2 :(得分:0)
这些代码行会有所帮助。
>>> msg = [["Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 0
8452810075over18's"],
... ['WINNER!! As a valued network customer you have been selected to receivea £900 prize reward! To claim call 09061701461. Claim code KL341. Valid
12 hours only.'],
... ['Had your mobile 11 months or more? U R entitled to Update to the latest colour mobiles with camera for Free! Call The Mobile Update Co FREE on
08002986030']]
>>> msg = [x[0] for x in msg]
>>> msg
["Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075o
ver18's", 'WINNER!! As a valued network customer you have been selected to receivea £900 prize reward! To claim call 09061701461. Claim code KL341. Va
lid 12 hours only.', 'Had your mobile 11 months or more? U R entitled to Update to the latest colour mobiles with camera for Free! Call The Mobile Upd
ate Co FREE on 08002986030']
答案 3 :(得分:0)
尝试一个for循环,“数据”是您从spam.take(3)返回的列表。
mylist = []
for entry in data:
print(entry)
for e in entry:
mylist.append(e)
print(mylist)