Question

我正在尝试使用Spark，Python API进行二元组计数。

我的输出很奇怪。多行：

 generator object genexpr at 0x11aab40

这是我的代码：

from pyspark import SparkConf, SparkContext
import string

conf = SparkConf().setMaster('local').setAppName('BigramCount')
sc = SparkContext(conf = conf)

RDDvar = sc.textFile("file:///home/cloudera/Desktop/smallTest.txt")

sentences = RDDvar.flatMap(lambda line: line.split("."))
words = sentences.flatMap(lambda line: line.split(" "))
bigrams = words.flatMap(lambda x:[((x[i],x[i+1]) for i in range(0,len(x)-1))])

result = bigrams.map(lambda bigram: bigram, 1)
aggreg1 = result.reduceByKey(lambda a, b: a+b)

result.saveAsTextFile("file:///home/cloudera/bigram_out")

出了什么问题？

Answer 1

您传递给flatMap的功能：

lambda x:[((x[i],x[i+1]) for i in range(0,len(x)-1))]

输出一个包含单个元素的列表，该元素是封闭的生成器表达式。 flatMap展平外部列表，剩下的是生成器的RDD。只需删除外部列表：

words.flatMap(lambda x:((x[i],x[i+1]) for i in range(0,len(x)-1)))

甚至更好地使用zip

words.flatMap(lambda xs: zip(xs, xs[1:])

Answer 2

以下是我的示例代码。

from __future__ import print_function

import sys
from operator import add
from pyspark import SparkContext

def split(line):
    words = line.split(" ")
    return [(words[i], words[i+1]) for i in range(len(words)-1)]

if __name__ == "__main__":
    if len(sys.argv) != 2:
        print("Usage: bigram <file>", file=sys.stderr)
        exit(-1)
    sc = SparkContext()
    lines = sc.textFile(sys.argv[1], 1)
    sentences = lines.glom() \
              .map(lambda x: " ".join(x)) \
              .flatMap(lambda x: x.split("."))

    bi_counts = sentences.flatMap(lambda line: split(line))\
        .map(lambda x: (x, 1))\
        .reduceByKey(add)

    bi_counts.saveAsTextFile("bigram_count.out")
    sc.stop()

HTH

Answer 3

Ngram功能已在库pyspark.ml中实现，易于使用且高效。

可以找到示例here。它是 Features 子包的一部分;以下是如何使用它的示例：

from pyspark.ml.feature import NGram
from pyspark.sql import Row
df = spark.createDataFrame([Row(tokens='The brown fox jumped over the white fence'.split())])
ngram = NGram(n=2, inputCol="tokens", outputCol="bigrams")
df = ngram.transform(df)

生成的DataFrame（df）将包含类型为Array(String)的新列名 bigrams ，并且输入列tokens会引发bigram。

Answer 4

Python似乎将生成器表达式存储为行中的变量：

bigrams = words.flatMap(lambda x:[((x[i],x[i+1]) for i in range(0,len(x)-1))])

您可能只需要用以下内容替换它：

bigrams = words.flatMap( lambda x:list((x[i],x[i+1]) for i in range(0,len(x)-1)) )

请参阅here以获得更深入的解释。

Bigram使用Spark（Python）生成奇怪的输出

4 个答案: