Question

我是python和pyspark的新手。我正在使用在pyspark上运行的代码来构建tfidfs模型。但是，当调用着名Pattern库的ngrams方法时，会触发UnboundLocalError。

这是由text提取的(id ,list[text])数据布局text.map(lambda x: (x["_id"], (x["span"], x["text"]))).groupByKey().map(lambda x: (x[0], list(x[1]))：

[(u'en.wikipedia.org/wiki/Woodville_South,_South_Australia',
  [u'Campbell was born in Myrtle Bank.']),
 (u'en.wikipedia.org/wiki/Picket_(military)',
  [u'The film dealt with the story .',
   u"Members of the Union force."]),
 (u'en.wikipedia.org/wiki/320th_Troop_Carrier_Squadron',
  [u" The 1st Air Transport."])]

这就是idfs的格式：

Out[23]:                                                                        
[{'_id': u'1,800', 'idf': 7.245417283738939},
 {'_id': u'Poetry', 'idf': 5.399590593240608},
 {'_id': u'Bloodworth', 'idf': 7.938564464298884},
 {'_id': u'Mullally', 'idf': 7.938564464298884}]

以下是我正在使用的代码的一部分：

corpus = text\
            .mapValues(lambda v: ngrams(v, self.max_ngram))\  """the ngrams method call """
            .flatMap(lambda (target, tokens): (((target, t), 1) for t in tokens))\
            .reduceByKey(add)\
            .map(lambda ((target, token), count): (token, (target, count)))\

这是text.py中的模式库方法：

def ngrams(string, n=3, punctuation=PUNCTUATION, continuous=False):
    """ Returns a list of n-grams (tuples of n successive words) from the given string.
        Alternatively, you can supply a Text or Sentence object.
        With continuous=False, n-grams will not run over sentence markers (i.e., .!?).
        Punctuation marks are stripped from words.
    """
    def strip_punctuation(s, punctuation=set(punctuation)):
        return [w for w in s if (isinstance(w, Word) and w.string or w) not in punctuation]
    if n <= 0:
        return []
    if isinstance(string, basestring):
        s = [strip_punctuation(s.split(" ")) for s in tokenize(string)]
    if isinstance(string, Sentence):
        s = [strip_punctuation(string)]
    if isinstance(string, Text):
        s = [strip_punctuation(s) for s in string]
    if continuous:
        s = [sum(s, [])]
    g = []
    for s in s:         """ ERROR triggered here """
        #s = [None] + s + [None]
        g.extend([tuple(s[i:i+n]) for i in range(len(s)-n+1)])
    return g

这是错误消息的跟踪：

python2.7/site-packages/sift/util.py", line 8, in ngrams
    for n in en.ngrams(text, n=i+1, **pattern_args):
python2.7/site-packages/pattern/text/__init__.py", line 83, in ngrams
    for s in s:
UnboundLocalError: local variable 's' referenced before assignment

我知道错误意味着我尝试编辑库方法然后错误仍然存在，所以也许我没有完全纠正它或者它在其他地方。如何解决此错误？

我正在使用python 2.7和pyspark 2.3.0。

非常感谢任何帮助或指导。

非常感谢，

Answer 1

如果我没记错你的上一个问题，“v”是groupByKey的结果。因此，最简单的方法就是将“v”变成字符串：

from pattern.en import ngrams

rdd = sc.parallelize([{'_id': u'en.wikipedia.org/wiki/Cerambycidae',
  'source': 'en.wikipedia.org/wiki/Plinthocoelium_virens',
  'span': (61, 73),
  'text': u'"Plinthocoelium virens" is a species of beetle in the family Cerambycidae.'},
 {'_id': u'en.wikipedia.org/wiki/Dru_Drury',
  'source': 'en.wikipedia.org/wiki/Plinthocoelium_virens',
  'span': (20, 29),
  'text': u'It was described by Dru Drury in 1770.'},
 {'_id': u'en.wikipedia.org/wiki/Dru_Drury',
  'source': 'en.wikipedia.org/wiki/Plinthocoelium_virens2',
  'span': (20, 29, 2),
  'text': u'It was described by Dru Drury in 1770.2'}])

print rdd.map(lambda x: (x["_id"], x["text"])).groupByKey()\
.map(lambda x: (x[0], list(x[1])))\
.mapValues(lambda v: ngrams(" ".join(v), 5))\
.collect()

[（u'en.wikipedia.org / wiki / Dru_Drury'，[（u'It'，u'was'，u'described' u'by'，u'Dru'），（你是'，你'描述'，你'，'u'Dru'，u'Drury'），（你''''，'u'by'，u'Dru'，u'Drury'，u'in'），（u'by'，u'Dru'， u'Drury'，u'in'，u'1770'），（你'，'你'，'你描述'，'u'by'， u'Dru'），（你是'，'你描述'，'u'by'，u'Dru'，u'Drury'），（你''''，'u'by'，u'Dru'，u'Drury'，u'in'），（u'by'，u'Dru'， u'Drury'，u'in'，u'1770.2'）]），（u'en.wikipedia.org / wiki / Cerambycidae'， [（'u'Plinthocoelium'，u'virens'，u'is'，u'a'，u'species'），（u'virens'， u'is'，u'a'，u'species'，u'of'），（u'是'，u'a'，u'species'，u'of'， u'beetle'），（u'a'，u'species'，u'of'，u'beetle'，u'in'），（u'species'， u'of'，u'beetle'，u'in'，u'the'），（u'of'，u'beetle'，u'in'，u'the'， u'family'），（u'beetle'，u'in'，u'the'，u'family'，u'Ceraycycidae'）]）

Answer 2

在if continuous:中，您引用了s。但如果没有验证以前的条件，s就不存在了。

一个简单的解决方法是在ifs之前为s提供初始值，例如[]。

此外，您可以重命名此变量以避免与循环变量发生冲突。

在Pattern python库中调用nragms方法时的UnboundLocalError

2 个答案: