Question

我正在研究语言模型，并希望计算两个后续单词的数字对。我在scala whith slicing函数中找到了这样一个问题的例子。虽然我没有设法在pyspark

中找到类比

data.splicing(2).map(lambda (x,y): ((x,y),1).redcueByKey(lambda x,y: x+y)

我想它应该是那样的。解决方法解决方案可能是一个创建函数，可以在数组中查找下一个单词，但我想应该有一个内置解决方案。

Answer 1

也许这会有所帮助。您可以在此处找到其他拆分方法：Is there a way to split a string by every nth separator in Python?

from itertools import izip

text = "I'm working on language model and want to count the number pairs of two consequent words.\
        I found an examples of such problem on language model and want to count the number pairs"

i = iter(text.split())

rdd = sc.parallelize([" ".join(x) for x in izip(i,i)])

print rdd.map(lambda x: (x, 1)).reduceByKey(lambda x, y: x + y).collect()

[（'找到'，1），（'计数'，2），（'想要'，2），（''的例子'， 1），（'model and'，2），（'on language'，2），（'number pairs'，2），（“我是工作“，1），（'结果词.I'，1），（'这样的问题'，1），（'的两个'，1）]

成对的两个结果词pyspark

1 个答案: