我有一个带有“文本”列的数据框,其中有许多行包含英语句子。
文本
It is evening
Good morning
Hello everyone
What is your name
I'll see you tomorrow
我有一个类型为List的变量,其中有一些单词,例如
val removeList = List("Hello", "evening", "because", "is")
我想从removeList中存在的列文本中删除所有这些单词。
所以我的输出应该是
It
Good morning
everyone
What your name
I'll see you tomorrow
如何使用Spark Scala做到这一点。
我写了这样的代码:
val stopWordsList = List("Hello", "evening", "because", "is");
val df3 = sqlContext.sql("SELECT text FROM table");
val df4 = df3.map(x => cleanText(x.mkString, stopWordsList));
def cleanText(x:String, stopWordsList:List[String]):Any = {
for(str <- stopWordsList) {
if(x.contains(str)) {
x.replaceAll(str, "")
}
}
}
但是我遇到了错误
Error:(44, 12) Unable to find encoder for type stored in a Dataset. Primitive types (Int, String, etc) and Product types (case classes) are supported by importing spark.implicits._ Support for serializing other types will be added in future releases.
val df4 = df3.map(x => cleanText(x.mkString, stopWordsList));
Error:(44, 12) not enough arguments for method map: (implicit evidence$6: org.apache.spark.sql.Encoder[String])org.apache.spark.sql.Dataset[String].
未指定值的参数证据$ 6。 val df4 = df3.map(x => cleanText(x.mkString,stopWordsList));
答案 0 :(得分:1)
检查这种df和rdd方式。
val df = Seq(("It is evening"),("Good morning"),("Hello everyone"),("What is your name"),("I'll see you tomorrow")).toDF("data")
val removeList = List("Hello", "evening", "because", "is")
val rdd2 = df.rdd.map{ x=> {val p = x.getAs[String]("data") ; val k = removeList.foldLeft(p) ( (p,t) => p.replaceAll("\\b"+t+"\\b","") ) ; Row(x(0),k) } }
spark.createDataFrame(rdd2, df.schema.add(StructField("new1",StringType))).show(false)
输出:
+---------------------+---------------------+
|data |new1 |
+---------------------+---------------------+
|It is evening |It |
|Good morning |Good morning |
|Hello everyone | everyone |
|What is your name |What your name |
|I'll see you tomorrow|I'll see you tomorrow|
+---------------------+---------------------+
答案 1 :(得分:0)
此代码对我有用。
火花 版本 2.3.0
,斯卡拉 版本 2.11.8
。
import org.apache.spark.sql.SparkSession
val data = List(
"It is evening",
"Good morning",
"Hello everyone",
"What is your name",
"I'll see you tomorrow"
)
val removeList = List("Hello", "evening", "because", "is")
val spark = SparkSession.builder.master("local[*]").appName("test").getOrCreate()
val sc = spark.sparkContext
import spark.implicits._
def cleanText(text: String, removeList: List[String]): String =
removeList.fold(text) {
case (text, termToRemove) => text.replaceAllLiterally(termToRemove, "")
}
val df1 = sc.parallelize(data).toDS // Dataset[String]
val df2 = df1.map(text => cleanText(text, removeList)) // Dataset[String]
import org.apache.spark.sql.SparkSession
val data = List(
"It is evening",
"Good morning",
"Hello everyone",
"What is your name",
"I'll see you tomorrow"
)
val removeList = List("Hello", "evening", "because", "is")
val spark = SparkSession.builder.master("local[*]").appName("test").getOrCreate()
val sc = spark.sparkContext
import spark.implicits._
def cleanText(text: String, removeList: List[String]): String =
removeList.fold(text) {
case (text, termToRemove) => text.replaceAllLiterally(termToRemove, "")
}
// Creates a temp table.
sc.parallelize(data).toDF("text").createTempView("table")
val df1 = spark.sql("SELECT text FROM table") // DataFrame = [text: string]
val df2 = df1.map(row => cleanText(row.getAs[String](fieldName = "text"), removeList)).toDF("text") // DataFrame = [text: string]