我希望用这种模式(word1.concat("-").concat(word2))
替换频率计数大于阈值的所有双字母组,并且我尝试过:
import org.apache.spark.{SparkConf, SparkContext}
object replace {
def main(args: Array[String]): Unit = {
val conf = new SparkConf()
.setMaster("local")
.setAppName("replace")
val sc = new SparkContext(conf)
val rdd = sc.textFile("data/ddd.txt")
val threshold = 2
val searchBigram=rdd.map {
_.split('.').map { substrings =>
// Trim substrings and then tokenize on spaces
substrings.trim.split(' ').
// Remove non-alphanumeric characters and convert to lowercase
map {
_.replaceAll( """\W""", "").toLowerCase()
}.
sliding(2)
}.flatMap {
identity
}
.map {
_.mkString(" ")
}
.groupBy {
identity
}
.mapValues {
_.size
}
}.flatMap {
identity
}.reduceByKey(_ + _).collect
.sortBy(-_._2)
.takeWhile(_._2 >= threshold)
.map(x=>x._1.split(' '))
.map(x=>(x(0), x(1))).toVector
val sample1 = sc.textFile("data/ddd.txt")
val sample2 = sample1.map(s=> s.split(" ") // split on space
.sliding(2) // take continuous pairs
.map{ case Array(a, b) => (a, b) }
.map(elem => if (searchBigram.contains(elem)) (elem._1.concat("-").concat(elem._2)," ") else elem)
.map{case (e1,e2) => e1}.mkString(" "))
sample2.foreach(println)
}
}
但是这段代码删除了每个文档的最后一个单词,并且当我在包含大量文档的文件上运行它时显示一些错误。
假设我的输入文件包含以下文档:
surprise heard thump opened door small seedy man clasping package wrapped.
upgrading system found review spring two thousand issue moody audio mortgage backed.
omg left gotta wrap review order asap . understand issue moody hand delivered dali lama
speak hands wear earplugs lives . listen maintain link long .
buffered lightning two thousand volts cables burned revivification place .
cables volts cables finally able hear auditory issue moody gem long rumored music .
我最喜欢的输出是:
surprise heard thump opened door small-man clasping package wrapped.
upgrading system found review spring two-thousand issue-moody audio mortgage backed.
omg left gotta wrap review order asap . understand issue-moody hand delivered dali lama
speak hands wear earplugs lives . listen maintain link long small-man .
buffered lightning two-thousand volts-cables burned revivification place .
cables volts-cables finally able hear auditory issue-moody gem long rumored music .
有人能帮助我吗?
答案 0 :(得分:3)
Spoonfeeding:
case class Bigram(first: String, second: String) {
def mkReplacement(s:String) = s.replaceAll(first + " " + second, first + "-" + second)
}
val data = List(
"surprise heard thump opened door small seedy man clasping package wrapped",
"upgrading system found review spring two thousand issue moody audio mortgage backed",
"omg left gotta wrap review order asap",
"understand issue moody hand delivered dali lama",
"speak hands wear earplugs lives . listen maintain link long",
"buffered lightning two thousand volts cables burned revivification place",
"cables volts cables finally able hear auditory issue moody gem long rumored music")
def stringToBigrams(s: String) = {
val words = s.split(" ")
if (words.size >= 2) {
words.sliding(2).map(a => Bigram(a(0), a(1)))
} else
Iterator[Bigram]()
}
val bigrams = data.flatMap { stringToBigrams }
//use reduceByKey rather than groupBy for Spark
val bigramCounts = bigrams.groupBy(identity).mapValues(_.size)
val threshold = 2
val topBigrams = bigramCounts.collect{case (b, c) if c >= threshold => b}
val replaced = data.map(r =>
topBigrams.foldLeft(r)((r, b) => b.mkReplacement(r)))
replaced.foreach(println)
//> surprise heard thump opened door small seedy man clasping package wrapped
//| upgrading system found review spring two-thousand issue-moody audio mortgage backed
//| omg left gotta wrap review order asap
//| understand issue-moody hand delivered dali lama
//| speak hands wear earplugs lives . listen maintain link long
//| buffered lightning two-thousand volts-cables burned revivification place
//| cables volts-cables finally able hear auditory issue-moody gem long rumored music
答案 1 :(得分:0)
def getNgrams(sentence):
out = []
sen = sentence.split(" ")
for k in range(len(sen)-1):
out.append((sen[k],sen[k+1]))
return out
if __name__ == '__main__':
try:
lsc = LocalSparkContext.LocalSparkContext("Recommendation","spark://BigData:7077")
sc = lsc.getBaseContext()
ssc = lsc.getSQLContext()
inFile = "bigramstxt.txt"
sen = sc.textFile(inFile,1)
v = 1
brv = sc.broadcast(v)
wordgroups = sen.flatMap(getNgrams).map(lambda t: (t,1)).reduceByKey(add).filter(lambda t: t[1]>brv.value)
bigrams = wordgroups.collect()
sc.stop()
inp = open(inFile,'r').read()
print inp
for b in bigrams:
print b
inp = inp.replace(" ".join(b[0]),"-".join(b[0]))
print inp
except:
raise
sc.stop()