我正在尝试从RDD中选择一些列,这些列具有从csv文件读取的数据。但是,这些操作会导致RDD由于某种原因而无法使用。
代码:
raw = sc.textFile('/FileStore/tables/Tweets.csv')
header = raw.first()
raw = raw.filter(lambda line: line!=header)
raw = raw.map(lambda x: x.split(',')[10:]).map(lambda x: x[:-4]).filter(lambda x: x)
raw.take(10)
结果: 像
[['@VirginAmerica What @dhepburn said.'],
["@VirginAmerica plus you've added commercials to the experience... tacky."],
["@VirginAmerica I didn't today... Must mean I need to take another trip!"],
['"@VirginAmerica it\'s really aggressive to blast obnoxious ""entertainment"" in your guests\' faces & they have little recourse"'],
["@VirginAmerica and it's a really big bad thing about it"],
['"@VirginAmerica yes',
' nearly every time I fly VX this “ear worm” won’t go away :)"'],
['"@VirginAmerica Really missed a prime opportunity for Men Without Hats parody',
' there."'],
['"@virginamerica Well', ' I didn\'t…but NOW I DO! :-D"'],
['"@VirginAmerica it was amazing',
' and arrived an hour early. You\'re too good to me."'],
['@VirginAmerica did you know that suicide is the second leading cause of death among teens 10-24']]
上述RDD的结构似乎有所不同。我在做什么错了。