Question

鉴于这三个列表理解，是否有更有效的方法来做到这一点，而不是三个故意集？我相信在这种情况下for循环可能是糟糕的形式，但如果我在rowsaslist中迭代大量的行，我觉得我下面的内容并不那么有效。

cachedStopWords = stopwords.words('english')

rowsaslist = [x.lower() for x in rowsaslist]
rowsaslist = [''.join(c for c in s if c not in string.punctuation) for s in rowsaslist]
rowsaslist = [' '.join([word for word in p.split() if word not in cachedStopWords]) for p in rowsaslist]

将这些全部合并到一个理解陈述中更有效吗？我从可读性的角度来看，它可能是一堆乱七八糟的代码。

Answer 1

您可以简单地定义2个函数并在单个列表理解中使用它们，而不是在同一个列表上迭代3次：

cachedStopWords = stopwords.words('english')


def remove_punctuation(text):
    return ''.join(c for c in text.lower() if c not in string.punctuation)

def remove_stop_words(text):
    return ' '.join([word for word in p.split() if word not in cachedStopWords])

rowsaslist = [remove_stop_words(remove_punctuation(text)) for text in rowsaslist]

我从未使用stopwords。如果它返回一个列表，您最好先将其转换为set以加快word not in cachedStopWords测试。

最后，NLTK包可能会帮助您处理文本。请参阅@alvas' answer。

Answer 2

按照您目前的方式，每个列表将在创建下一个列表之前完全创建。你可以通过从列表推导切换到生成器表达式来解决这个问题（注意使用()而不是[]）：

rowsaslist = (x.lower() for x in rows as list) 
rowsaslist = (''.join(c for c in s if c not in string.punctuation) for s in rows as list) 
rowsaslist = (' '.join([word for word in p.split() if word not in cachedStopWords]) for p in rowsaslist)

而不是创建列表，这将创建3个生成器。每个生成器只会根据需要生成一个值，而不是一次严格创建每个列表。

Answer 3

我赞成这里的功能性方法*

var geturl = window.location.href.split('?')[1];
$(document).on('click', 'button', function() {
    window.open('https://first-site.com?accept');
});
// Because this is a Google extension, the query works in `window.open()`, which opens above
if(geturl == 'accept'){
    $.ajax({
        url: 'https://second-site.com/accept',
        data: {
            dataid: '...',
            passid: '...',
        },
        type: 'POST',
        success: function(){window.close();}
    });
}

它像罪一样丑陋，但实际上没有办法让这个不丑陋。评论对于这些大型一体化处理工作是有益的。

' '.join(filter(lambda word: word not in cachedStopWords,
                ''.join(filter(lambda c: c not in string.punctuation,
                       map(str.lower, rowsaslist))).split())

这完美地解释了一切。

*诚然，这可能是因为我在Haskell中越来越多地玩游戏了！

Answer 4

根据您是否需要相应地对结果列表进行排序作为输入，至少有两种方法可以解决此问题。

首先，您有两个似乎要删除的黑名单：

标点符号
停止说话。

并且您希望通过循环遍历字符来删除标点符号，而您希望通过循环标记来删除停用词。

假设输入是未标记化的人类可读字符串。

为什么标点符号不能作为标记？这样你就可以通过循环标记来删除标点符号和停用词，即

>>> from nltk import word_tokenize
>>> from nltk.corpus import stopwords
>>> from string import punctuation
>>> blacklist = set(punctuation).union(set(stopwords.words('english')))
>>> blacklist
set([u'all', u'just', u'being', u'when', u'over', u'through', u'during', u'its', u'before', '$', u'hadn', '(', u'll', u'had', ',', u'should', u'to', u'only', u'does', u'under', u'ours', u'has', '<', '@', u'them', u'his', u'very', u'they', u'not', u'yourselves', u'now', '\\', u'nor', '`', u'd', u'did', u'shan', u'didn', u'these', u'she', u'each', u'where', '|', u'because', u'doing', u'there', u'theirs', u'some', u'we', u'him', u'up', u'are', u'further', u'ourselves', u'out', '#', "'", '+', u'weren', '/', u're', u'won', u'above', u'between', ';', '?', u't', u'be', u'hasn', u'after', u'here', u'shouldn', u'hers', '[', u'by', '_', u'both', u'about', u'couldn', u'of', u'o', u's', u'isn', '{', u'or', u'own', u'into', u'yourself', u'down', u'mightn', u'wasn', u'your', u'he', '"', u'from', u'her', '&', u'aren', '*', u'been', '.', u'few', u'too', u'wouldn', u'then', u'themselves', ':', u'was', u'until', '>', u'himself', u'on', u'with', u'but', u'mustn', u'off', u'herself', u'than', u'those', '^', u'me', u'myself', u'ma', u'this', u'whom', u'will', u'while', u'ain', u'below', u'can', u'were', u'more', u'my', '~', u'and', u've', u'do', u'is', u'in', u'am', u'it', u'doesn', u'an', u'as', u'itself', u'against', u'have', u'our', u'their', u'if', '!', u'again', '%', u'no', ')', u'that', '-', u'same', u'any', u'how', u'other', u'which', u'you', '=', u'needn', u'y', u'haven', u'who', u'what', u'most', u'such', ']', u'why', u'a', u'don', u'for', u'i', u'm', u'having', u'so', u'at', u'the', '}', u'yours', u'once'])
>>> sent = "This is a humanly readable string, that Tina Guo doesn't want to play"
>>> [word for word in word_tokenize(sent) if word not in blacklist]
['This', 'humanly', 'readable', 'string', 'Tina', 'Guo', "n't", 'want', 'play']

如果您不需要输入字词的顺序，使用set().difference功能可能会加快您的代码速度：

>>> set(word_tokenize(sent)).difference(blacklist)
set(['humanly', 'play', 'string', 'This', 'readable', 'Guo', 'Tina', "n't", 'want'])

或者，如果您不想对字符串进行标记，则可以使用str.translate删除标点符号，这肯定比循环字符更有效：

>>> sent
"This is a humanly readable string, that Tina Guo doesn't want to play"
>>> sent.translate(None, punctuation)
'This is a humanly readable string that Tina Guo doesnt want to play't
>>> stoplist = stopwords.words('english')
>>> [word for word in sent.translate(None, punctuation).split() if word not in stoplist]
['This', 'humanly', 'readable', 'string', 'Tina', 'Guo', 'doesnt', 'want', 'play']

在Python中执行多个列表推导的最有效方法

4 个答案: