Question

我目前创建了一个这样的列表：

stopfile = os.path.join(baseDir, inputPath, STOPWORDS_PATH)
stopwords = set(sc.textFile(stopfile).collect())
print 'These are the stopwords: %s' % stopwords

def tokenize(string):
    """ An implementation of input string tokenization that excludes stopwords
    Args:
        string (str): input string
    Returns:
        list: a list of tokens without stopwords
    """
    res = list()
    for word in simpleTokenize(string):
        if word not in stopwords:
            res.append(word)
    return res

simpleTokenize只是字符串上的一个基本拆分函数，它返回一个字符串列表。

Answer 1

这很好。如果你想以更“Pythonic”的方式（一行代码而不是4行），你可以使用列表理解：

res = [word for word in simpleTokenize(string) if word not in stopwords]

Answer 2

您已经在使用set，这是最大的潜在加速（基于我希望您的代码进行list.__contains__测试的问题标题）。我建议的唯一剩下的事情就是让你的函数成为一个生成器，所以你不需要创建res列表：

def tokenize(text):
    for word in simpleTokenize(string):
        if word not in stopwords:
            yield word

Answer 3

您可以使用过滤功能

stopfile = os.path.join(baseDir, inputPath, STOPWORDS_PATH)
stopwords = set(sc.textFile(stopfile).collect())
print 'These are the stopwords: %s' % stopwords

def tokenize(string):
    """ An implementation of input string tokenization that excludes stopwords
    Args:
        string (str): input string
    Returns:
        list: a list of tokens without stopwords
    """
    return filter(lambda x:x not in stopwords, simpleTokenize(string))

根据另一个列表中的值过滤列表中值的最有效方法是什么

3 个答案: