从RDD中选择列以从CSV文件中读取数据会导致奇怪的RDD格式

时间:2019-05-08 06:14:08

标签: python pyspark rdd

我正在尝试从RDD中选择一些列,这些列具有从csv文件读取的数据。但是,这些操作会导致RDD由于某种原因而无法使用。

代码:

raw = sc.textFile('/FileStore/tables/Tweets.csv')
header = raw.first()
raw = raw.filter(lambda line: line!=header)
raw = raw.map(lambda x: x.split(',')[10:]).map(lambda x: x[:-4]).filter(lambda x: x)
raw.take(10)

结果: 像

[['@VirginAmerica What @dhepburn said.'],
 ["@VirginAmerica plus you've added commercials to the experience... tacky."],
 ["@VirginAmerica I didn't today... Must mean I need to take another trip!"],
 ['"@VirginAmerica it\'s really aggressive to blast obnoxious ""entertainment"" in your guests\' faces & they have little recourse"'],
 ["@VirginAmerica and it's a really big bad thing about it"],
 ['"@VirginAmerica yes',
  ' nearly every time I fly VX this “ear worm” won’t go away :)"'],
 ['"@VirginAmerica Really missed a prime opportunity for Men Without Hats parody',
  ' there."'],
 ['"@virginamerica Well', ' I didn\'t…but NOW I DO! :-D"'],
 ['"@VirginAmerica it was amazing',
  ' and arrived an hour early. You\'re too good to me."'],
 ['@VirginAmerica did you know that suicide is the second leading cause of death among teens 10-24']]

上述RDD的结构似乎有所不同。我在做什么错了。

0 个答案:

没有答案