Question

下面是示例原始数据

starCount = arr.reduce(reducer,0);

下面是我的程序

tweet_id,airline_sentiment,airline_sentiment_confidence,negativereason,negativereason_confidence,airline,airline_sentiment_gold,name,negativereason_gold,retweet_count,text,tweet_coord,tweet_created,tweet_location,user_timezone
570306133677760513,neutral,1.0,,,Virgin America,,cairdin,,0,@VirginAmerica What @dhepburn said.,,2015-02-24 11:35:52 -0800,,Eastern Time (US & Canada)

它给出了如下异常：

val data = sc.textFile("/user/inputs/Tweets.csv")
val map_data = data.map(x=> x.split(","))
val filterdata  = map_data.filter(x=> x(5) == "Virgin America").count()

Answer 1

您的数据不可拆分，这就是获取数组索引超出范围的原因，请参见下面的代码...。它将在选项1中复制您的版本

我使用spark csv api对其进行了优化，这可能对您有用。


    package examples

    import org.apache.log4j.Level

    object CSVTest extends App {
      import org.apache.spark.sql.{Dataset, SparkSession}
      val spark = SparkSession.builder().appName("CsvExample").master("local").getOrCreate()
      val logger = org.apache.log4j.Logger.getLogger("org")
      logger.setLevel(Level.WARN)
      import spark.implicits._
      import org.apache.spark.sql.functions._
      val csvData: Dataset[String] = spark.sparkContext.parallelize(
        """
          |tweet_id,airline_sentiment,airline_sentiment_confidence,negativereason,negativereason_confidence,airline,airline_sentiment_gold,name,negativereason_gold,retweet_count,text,tweet_coord,tweet_created,tweet_location,user_timezone
          |570306133677760513,neutral,1.0,,,Virgin America,,cairdin,,0,@VirginAmerica What @dhepburn said.,,2015-02-24 11:35:52 -0800,,Eastern Time (US & Canada)
        """.stripMargin.lines.toList).toDS()


      println("option 2 : spark csv version ")
      val frame = spark.read.option("header", true).option("inferSchema",true).csv(csvData)
      frame.show()
      frame.printSchema()
     println( frame.filter($"airline" === "Virgin America").count())

     println("option 1: your version which is not splittable thats the reason getting arrayindex out of bound ")
      val filterdata = csvData.map(x=> x.split(","))
      filterdata.foreach(x => println(x.mkString))
    //    filterdata.show(false)
    //    filterdata.filter {x=> {
    //      println(x)
    //      x(5) == "Virgin America"
    //    }
    //    }
    //  .count()


    }

结果：

option 2 : spark csv version 
+------------------+-----------------+----------------------------+--------------+-------------------------+--------------+----------------------+-------+-------------------+-------------+--------------------+-----------+--------------------+--------------+--------------------+
|          tweet_id|airline_sentiment|airline_sentiment_confidence|negativereason|negativereason_confidence|       airline|airline_sentiment_gold|   name|negativereason_gold|retweet_count|                text|tweet_coord|       tweet_created|tweet_location|       user_timezone|
+------------------+-----------------+----------------------------+--------------+-------------------------+--------------+----------------------+-------+-------------------+-------------+--------------------+-----------+--------------------+--------------+--------------------+
|570306133677760513|          neutral|                         1.0|          null|                     null|Virgin America|                  null|cairdin|               null|            0|@VirginAmerica Wh...|       null|2015-02-24 11:35:...|          null|Eastern Time (US ...|
+------------------+-----------------+----------------------------+--------------+-------------------------+--------------+----------------------+-------+-------------------+-------------+--------------------+-----------+--------------------+--------------+--------------------+

root
 |-- tweet_id: long (nullable = true)
 |-- airline_sentiment: string (nullable = true)
 |-- airline_sentiment_confidence: double (nullable = true)
 |-- negativereason: string (nullable = true)
 |-- negativereason_confidence: string (nullable = true)
 |-- airline: string (nullable = true)
 |-- airline_sentiment_gold: string (nullable = true)
 |-- name: string (nullable = true)
 |-- negativereason_gold: string (nullable = true)
 |-- retweet_count: integer (nullable = true)
 |-- text: string (nullable = true)
 |-- tweet_coord: string (nullable = true)
 |-- tweet_created: string (nullable = true)
 |-- tweet_location: string (nullable = true)
 |-- user_timezone: string (nullable = true)

1
option 1: your version which is not splittable thats the reason getting arrayindex out of bound 
tweet_idairline_sentimentairline_sentiment_confidencenegativereasonnegativereason_confidenceairlineairline_sentiment_goldnamenegativereason_goldretweet_counttexttweet_coordtweet_createdtweet_locationuser_timezone
570306133677760513neutral1.0Virgin Americacairdin0@VirginAmerica What @dhepburn said.2015-02-24 11:35:52 -0800Eastern Time (US & Canada)


Process finished with exit code 0

Answer 2

我找到了解决方案，它是我的数据集中的一些空数据导致arrayindexoutofboundexception。我将函数更改如下：

我使用array.length函数过滤掉了我们有空数据的那些行

样本数据

DBN，学校名称，应试人数，批判性阅读方式，数学方式，写作方式亨利街国际学校01M292，31,391,425,385 01M448，大学邻里高中，60,394,419,387 东区社区高中01M450，69,418,431,402 01M458，STELLITE ACADEMY FORSYTH，26,385,370,378 01M509，CMSP HIGH SCHOOL 、、、

'''

     val conf = new SparkConf().setAppName("Spark Scala School Data analysis Example").setMaster("local[1]")
    val sc = new SparkContext(conf) 
    val data = sc.textFile("C:\\Sankhadeep\\Study\\data\\SAT_School_Level_Results.csv", 2)
    val spark = SparkSession.builder().appName("sample").master("local").getOrCreate()
    val map_data = data.map(x=> x.split(","))
    //val map_data1 = map_data.map(x=> handleNull(x))
    val sample = map_data.filter(x=> (x.length > 4)).filter(x=> (x(4) != "Mathematics Mean")).filter(x=> (x(4).toInt > 500  ))
    '''

火花过滤器功能出错

2 个答案:

样本数据