Question

我有similar problem before，但我正在寻找一个普遍的答案。我正在使用spark-corenlp来获取电子邮件中的情感分数。有时，情绪（）在某些输入上崩溃（可能它太长了，也许它有一个意想不到的特征）。它没有告诉我它在某些情况下崩溃，只返回it。因此，当我尝试Column sentiment('email)超过某个点或show()我的数据框时，我得到save()因为java.util.NoSuchElementException必须在该行没有返回任何内容。

我的初始代码正在加载数据，并应用sentiment() API中显示的sentiment()。

spark-corenlp

我尝试过滤null和NaN值：

       val customSchema = StructType(Array(
                        StructField("contactId", StringType, true),
                        StructField("email", StringType, true))
                        )

// Load dataframe   
val df = sqlContext.read
                        .format("com.databricks.spark.csv")
                        .option("delimiter","\t")          // Delimiter is tab
                        .option("parserLib", "UNIVOCITY")  // Parser, which deals better with the email formatting
                        .schema(customSchema)              // Schema of the table
                        .load("emails")                        // Input file


    val sent = df.select('contactId, sentiment('email).as('sentiment)) // Add sentiment analysis output to dataframe

我甚至试图通过SQL查询来做到这一点：

val sentFiltered = sent.filter('sentiment.isNotNull)
                .filter(!'sentiment.isNaN)
                .filter(col("sentiment").between(0,4))

我不知道是什么输入导致火花 - 崩溃崩溃。我该怎么知道？另外，如何从col中过滤这些不存在的值（＆＃34;情感＆＃34;）？否则，我应该尝试捕获异常并忽略该行吗？这甚至可能吗？

Spark Scala - java.util.NoSuchElementException＆amp;数据清理

0 个答案: