Question

我有2个数据集。一个是带有一堆数据的数据帧，一列有注释（一个字符串）。另一个是单词列表。

如果评论中包含单词中的单词，我想用@@@@@替换评论中的单词，并使用替换的单词完整地返回评论。

以下是一些示例数据：

CommentSample.txt

1   A badword small town
2   "Love the truck, though rattle is annoying."
3   Love the paint!
4   
5   "Like that you added the ""oh badword2"" handle to passenger side."
6   "badword you. specific enough for you, badword3?"   
7   This car is a piece if badword2

ProfanitySample.txt

badword
badword2
badword3

到目前为止，这是我的代码：

val sqlContext = new org.apache.spark.sql.SQLContext(sc)

import sqlContext.implicits._

case class Response(UniqueID: Int, Comment: String)

val response = sc.textFile("file:/data/CommentSample.txt").map(_.split("\t")).filter(_.size == 2).map(r => Response(r(0).trim.toInt, r(1).trim.toString, r(10).trim.toInt)).toDF()

var profanity = sc.textFile("file:/data/ProfanitySample.txt").map(x => (x.toLowerCase())).toArray();

    def replaceProfanity(s: String): String = {
        val l = s.toLowerCase()
        val r = "@@@@@"
        if(profanity.contains(s))
            r
        else
            s
      }

    def processComment(s: String): String = {
        val commentWords = sc.parallelize(s.split(' '))
        commentWords.foreach(replaceProfanity)
        commentWords.collect().mkString(" ")
      }

    response.select(processComment("Comment")).show(100)

它编译，运行，但单词不会被替换。我不知道如何在scala中调试。我是全新的！这是我的第一个项目！

非常感谢任何指针。 -M

Answer 1

首先，我认为您在此处描述的用例不会因使用DataFrame而受益匪浅 - 仅使用RDD实现起来更简单（当您使用SQL轻松描述转换时，DataFrames非常方便，而不是这里的情况）。

所以 - 这是使用RDD的可能实现。这个假设亵渎名单不是太大（即最多〜数千），所以我们可以把它收集到非分布式内存中。如果情况并非如此，则可能需要采用不同的方法（涉及连接）。

case class Response(UniqueID: Int, Comment: String)

val mask = "@@@@@"

val responses: RDD[Response] = sc.textFile("file:/data/CommentSample.txt").map(_.split("\t")).filter(_.size == 2).map(r => Response(r(0).trim.toInt, r(1).trim))
val profanities: Array[String] = sc.textFile("file:/data/ProfanitySample.txt").collect()

val result = responses.map(r => {
  // using foldLeft here means we'll replace profanities one by one, 
  // with the result of each replace as the input of the next,
  // starting with the original comment 
  profanities.foldLeft(r.Comment)({ 
     case (updatedComment, profanity) => updatedComment.replaceAll(s"(?i)\\b$profanity\\b", mask) 
  })
})

result.take(10).foreach(println) // just printing some examples...

请注意，不区分大小写和“仅限单词”限制在正则表达式中实现："(?i)\\bSomeWord\\b"。

spark scala - 如果列表

1 个答案: