下面是示例原始数据
starCount = arr.reduce(reducer,0);
下面是我的程序
tweet_id,airline_sentiment,airline_sentiment_confidence,negativereason,negativereason_confidence,airline,airline_sentiment_gold,name,negativereason_gold,retweet_count,text,tweet_coord,tweet_created,tweet_location,user_timezone
570306133677760513,neutral,1.0,,,Virgin America,,cairdin,,0,@VirginAmerica What @dhepburn said.,,2015-02-24 11:35:52 -0800,,Eastern Time (US & Canada)
它给出了如下异常:
val data = sc.textFile("/user/inputs/Tweets.csv")
val map_data = data.map(x=> x.split(","))
val filterdata = map_data.filter(x=> x(5) == "Virgin America").count()
答案 0 :(得分:2)
您的数据不可拆分,这就是获取数组索引超出范围的原因,请参见下面的代码...。它将在选项1中复制您的版本
我使用spark csv api对其进行了优化,这可能对您有用。
package examples
import org.apache.log4j.Level
object CSVTest extends App {
import org.apache.spark.sql.{Dataset, SparkSession}
val spark = SparkSession.builder().appName("CsvExample").master("local").getOrCreate()
val logger = org.apache.log4j.Logger.getLogger("org")
logger.setLevel(Level.WARN)
import spark.implicits._
import org.apache.spark.sql.functions._
val csvData: Dataset[String] = spark.sparkContext.parallelize(
"""
|tweet_id,airline_sentiment,airline_sentiment_confidence,negativereason,negativereason_confidence,airline,airline_sentiment_gold,name,negativereason_gold,retweet_count,text,tweet_coord,tweet_created,tweet_location,user_timezone
|570306133677760513,neutral,1.0,,,Virgin America,,cairdin,,0,@VirginAmerica What @dhepburn said.,,2015-02-24 11:35:52 -0800,,Eastern Time (US & Canada)
""".stripMargin.lines.toList).toDS()
println("option 2 : spark csv version ")
val frame = spark.read.option("header", true).option("inferSchema",true).csv(csvData)
frame.show()
frame.printSchema()
println( frame.filter($"airline" === "Virgin America").count())
println("option 1: your version which is not splittable thats the reason getting arrayindex out of bound ")
val filterdata = csvData.map(x=> x.split(","))
filterdata.foreach(x => println(x.mkString))
// filterdata.show(false)
// filterdata.filter {x=> {
// println(x)
// x(5) == "Virgin America"
// }
// }
// .count()
}
结果:
option 2 : spark csv version
+------------------+-----------------+----------------------------+--------------+-------------------------+--------------+----------------------+-------+-------------------+-------------+--------------------+-----------+--------------------+--------------+--------------------+
| tweet_id|airline_sentiment|airline_sentiment_confidence|negativereason|negativereason_confidence| airline|airline_sentiment_gold| name|negativereason_gold|retweet_count| text|tweet_coord| tweet_created|tweet_location| user_timezone|
+------------------+-----------------+----------------------------+--------------+-------------------------+--------------+----------------------+-------+-------------------+-------------+--------------------+-----------+--------------------+--------------+--------------------+
|570306133677760513| neutral| 1.0| null| null|Virgin America| null|cairdin| null| 0|@VirginAmerica Wh...| null|2015-02-24 11:35:...| null|Eastern Time (US ...|
+------------------+-----------------+----------------------------+--------------+-------------------------+--------------+----------------------+-------+-------------------+-------------+--------------------+-----------+--------------------+--------------+--------------------+
root
|-- tweet_id: long (nullable = true)
|-- airline_sentiment: string (nullable = true)
|-- airline_sentiment_confidence: double (nullable = true)
|-- negativereason: string (nullable = true)
|-- negativereason_confidence: string (nullable = true)
|-- airline: string (nullable = true)
|-- airline_sentiment_gold: string (nullable = true)
|-- name: string (nullable = true)
|-- negativereason_gold: string (nullable = true)
|-- retweet_count: integer (nullable = true)
|-- text: string (nullable = true)
|-- tweet_coord: string (nullable = true)
|-- tweet_created: string (nullable = true)
|-- tweet_location: string (nullable = true)
|-- user_timezone: string (nullable = true)
1
option 1: your version which is not splittable thats the reason getting arrayindex out of bound
tweet_idairline_sentimentairline_sentiment_confidencenegativereasonnegativereason_confidenceairlineairline_sentiment_goldnamenegativereason_goldretweet_counttexttweet_coordtweet_createdtweet_locationuser_timezone
570306133677760513neutral1.0Virgin Americacairdin0@VirginAmerica What @dhepburn said.2015-02-24 11:35:52 -0800Eastern Time (US & Canada)
Process finished with exit code 0
答案 1 :(得分:0)
我找到了解决方案,它是我的数据集中的一些空数据导致arrayindexoutofboundexception。我将函数更改如下:
我使用array.length函数过滤掉了我们有空数据的那些行
DBN,学校名称,应试人数,批判性阅读方式,数学方式,写作方式 亨利街国际学校01M292,31,391,425,385 01M448,大学邻里高中,60,394,419,387 东区社区高中01M450,69,418,431,402 01M458,STELLITE ACADEMY FORSYTH,26,385,370,378 01M509,CMSP HIGH SCHOOL 、、、
'''
val conf = new SparkConf().setAppName("Spark Scala School Data analysis Example").setMaster("local[1]")
val sc = new SparkContext(conf)
val data = sc.textFile("C:\\Sankhadeep\\Study\\data\\SAT_School_Level_Results.csv", 2)
val spark = SparkSession.builder().appName("sample").master("local").getOrCreate()
val map_data = data.map(x=> x.split(","))
//val map_data1 = map_data.map(x=> handleNull(x))
val sample = map_data.filter(x=> (x.length > 4)).filter(x=> (x(4) != "Mathematics Mean")).filter(x=> (x(4).toInt > 500 ))
'''