Question

我正在尝试从RDD中选择一些列，这些列具有从csv文件读取的数据。但是，这些操作会导致RDD由于某种原因而无法使用。

代码：

raw = sc.textFile('/FileStore/tables/Tweets.csv')
header = raw.first()
raw = raw.filter(lambda line: line!=header)
raw = raw.map(lambda x: x.split(',')[10:]).map(lambda x: x[:-4]).filter(lambda x: x)
raw.take(10)

结果：像

[['@VirginAmerica What @dhepburn said.'],
 ["@VirginAmerica plus you've added commercials to the experience... tacky."],
 ["@VirginAmerica I didn't today... Must mean I need to take another trip!"],
 ['"@VirginAmerica it\'s really aggressive to blast obnoxious ""entertainment"" in your guests\' faces &amp; they have little recourse"'],
 ["@VirginAmerica and it's a really big bad thing about it"],
 ['"@VirginAmerica yes',
  ' nearly every time I fly VX this “ear worm” won’t go away :)"'],
 ['"@VirginAmerica Really missed a prime opportunity for Men Without Hats parody',
  ' there."'],
 ['"@virginamerica Well', ' I didn\'t…but NOW I DO! :-D"'],
 ['"@VirginAmerica it was amazing',
  ' and arrived an hour early. You\'re too good to me."'],
 ['@VirginAmerica did you know that suicide is the second leading cause of death among teens 10-24']]

上述RDD的结构似乎有所不同。我在做什么错了。

从RDD中选择列以从CSV文件中读取数据会导致奇怪的RDD格式

0 个答案: