在Spark映射操作中使用Scala过滤器

时间:2017-02-22 17:30:13

标签: scala apache-spark twitter

我有一个小推文数据集,我想从推文中删除用户名。我应该删除所有以@开头的单词,但是在下面代码的最后一个map()操作中,我得到一个java.lang.StringIndexOutOfBoundsException: String index out of range: 0。 在地图操作中,我将句子分成单词,然后使用集合中的过滤操作而不是Spark,我想知道问题与此有关。我试图评论.filter(_(0) != '@'),一切正常

val logFile = "tweets10.csv"
val config = new SparkConf().setMaster("local").setAppName("Spark App")
val sc = new SparkContext(config)

val logData = sc.textFile(logFile, 2).cache()


val tweets = logData.mapPartitionsWithIndex((index, line) => if (index == 0) line.drop(1) else line)
                              .map(_.split(",")(1).replace("\"", ""))
                              .map(line => line.split(" ")
                                  .filter(_(0) != '@')
                                  .reduce((x,y) => x + " " + y))

数据集:

"","text","favorited","favoriteCount","replyToSN","created","truncated","replyToSID","id","replyToUID","statusSource","screenName","retweetCount","isRetweet","retweeted","longitude","latitude"
"1","RT @WDD: Check today how you can join World Diabetes Day: htts/EIQ1Za0R0t. Eyes on #diabetes htts/rN3VJYC7T0",FALSE,0,NA,2016-09-07 20:12:03,FALSE,NA,"773614831018643457",NA,"<a href=""htt://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>","un_ncd",27,TRUE,FALSE,NA,NA
"2","RT @JDRFUK: With his #Rio2016 medal in hand Team GB gymnast @louissmith1989 puts type 1 #diabetes in the picture! htts:/OKkPtQLuvi",FALSE,0,NA,2016-09-07 20:10:44,FALSE,NA,"773614501853880320",NA,"<a href=""htt://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>","sg0809",2,TRUE,FALSE,NA,NA
"3","RT @CleanairCA: Speaking of the things in the air you breath...
    #asthma #diabetes #copd #lungcancer #smog #losangeles #HeartDisease htts:/",FALSE,0,NA,2016-09-07 20:09:03,FALSE,NA,"773614075284746240",NA,"<a href=""htt://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>","tt85207533",9,TRUE,FALSE,NA,NA
"4","So - tonight's #tweetchat is about FOOD - ""#Diabetes and Diets"" (aka - stuff we eat)  #gbdoc",FALSE,1,NA,2016-09-07 20:08:28,FALSE,NA,"773613929515941888",NA,"<a href=""htt://www.tchat.io"" rel=""nofollow"">tchat.io</a>","theGBDOC",0,FALSE,FALSE,NA,NA
"5","Learn the most important things you can do to prevent #diabetes here: htts:/eHu5pesgKw.",FALSE,0,NA,2016-09-07 20:07:00,FALSE,NA,"773613560320495617",NA,"<a href=""htt://sproutsocial.com"" rel=""nofollow"">Sprout Social</a>","MountainPointMC",0,FALSE,FALSE,NA,NA
"6","Cancer risk #NaturalCures #AlternativeMedicine #Cures #Healing #HerbalRemedies #Diabetes htts:/Ul0vwRpqbw htts:/YU77iuudeR",FALSE,0,NA,2016-09-07 20:06:09,FALSE,NA,"773613345480007680",NA,"<a href=""htt://www.socialcloudsuite.com"" rel=""nofollow"">SocialCloudSuite</a>","CureExchange",0,FALSE,FALSE,NA,NA
"7","Cancer risk #NaturalCures #AlternativeMedicine #Cures #Healing #HerbalRemedies #Diabetes htts:/wEjrW9f9b1 htts:/iHlSpbwzZl",FALSE,0,NA,2016-09-07 20:06:08,FALSE,NA,"773613341826805760",NA,"<a href=""htt://www.socialcloudsuite.com"" rel=""nofollow"">SocialCloudSuite</a>","GuineaHenWeed",0,FALSE,FALSE,NA,NA
"8","Linda Yip hopes to find better ways to diagnose, treat &amp; prevent #diabetes: htts:/tmjgnEFUkZ  #WIMmonth htts:/xL25me7ckK",FALSE,0,NA,2016-09-07 20:05:14,FALSE,NA,"773613114533171200",NA,"<a href=""htts://about.twitter.com/products/tweetdeck"" rel=""nofollow"">TweetDeck</a>","StanfordDeptMed",0,FALSE,FALSE,NA,NA
"9","A Farm Stand In South Dallas Is Fighting #Diabetes With Common Sense And Vegetables htts:/l9pWvnAA5W",FALSE,0,NA,2016-09-07 20:05:08,FALSE,NA,"773613090378166273",NA,"<a href=""htt://www.hootsuite.com"" rel=""nofollow"">Hootsuite</a>","DiabetesDallas",0,FALSE,FALSE,NA,NA
"10","Hi #gbdoc Paul here, #t1d #teampump and #cgm - 4.5 years with #diabetes now!",FALSE,0,NA,2016-09-07 20:04:25,FALSE,NA,"773612908693614592",NA,"<a href=""htt://itunes.apple.com/us/app/twitter/id409789998?mt=12"" rel=""nofollow"">Twitter for Mac</a>","t1hba1c",0,FALSE,FALSE,NA,NA

1 个答案:

答案 0 :(得分:3)

在不知道数据集实际包含的内容的情况下,我将在此处进行预感并说在拆分后您的数据集包含空字符串。添加额外的空白检查:

_.split(" ")
 .filter(word => word != "" && word(0) != '@')
 .reduce((x,y) => x + " " + y)