Question

有一长串评论（50条说），例如：

“这是我们旅行中最大的失望。餐厅有得到了一些非常好的评论，所以我们的期望很高。该即使餐厅不是很满，服务也很慢。我有房子沙拉本来可以从我们的任何s。声中出来。 keshi yena，虽然美味让我想起barbequed拉鸡。这家餐厅非常被高估了。“

我想创建一个使用python保留句子标记化的单词列表。

删除停用词后，我想要保留所有50条评论的结果，其中保留句子标记，并在每个标记化句子中保留单词标记。最后，我希望结果类似于：

list(c("disappointment", "trip"), 
     c("restaurant", "received", "good", "reviews", "expectations", "high"), 
     c("service", "slow", "even", "though", "restaurant", "full"),
     c("house", "salad", "come", "us"), 
     c("although", "tasty", "reminded", "pulled"), 
     "restaurant")

我怎么能在python中做到这一点？在这种情况下，R是一个不错的选择吗？我真的很感激你的帮助。

Answer 1

不确定你是否想要R，但根据你的要求，我认为它也可以用纯粹的pythonic方式完成。

你基本上想要一个包含每个句子的重要单词列表（不是停用词）的列表。

所以你可以做点什么

input_reviews = """
this was the biggest disappointment of our trip. the restaurant had received some very good reviews, so our expectations were high. 
the service was slow even though the restaurant was not very full. I had the house salad which could have come out of any sizzler in the us. 
the keshi yena, although tasty reminded me of barbequed pulled chicken. this restaurant is very overrated.
"""

# load your stop words list here
stop_words_list = ['this', 'was', 'the', 'of', 'our', 'biggest', 'had', 'some', 'very', 'so', 'were', 'not']


def main():
    sentences = input_reviews.split('.')
    sentence_list = []
    for sentence in sentences:
        inner_list = []
        words_in_sentence = sentence.split(' ')
        for word in words_in_sentence:
            stripped_word = str(word).lstrip('\n')
            if stripped_word and stripped_word not in stop_words_list:
                # this is a good word
                inner_list.append(stripped_word)

        if inner_list:
            sentence_list.append(inner_list)

    print(sentence_list)



if __name__ == '__main__':
    main()

在我的结尾，这输出

[['disappointment', 'trip'], ['restaurant', 'received', 'good', 'reviews,', 'expectations', 'high'], ['service', 'slow', 'even', 'though', 'restaurant', 'full'], ['I', 'house', 'salad', 'which', 'could', 'have', 'come', 'out', 'any', 'sizzler', 'in', 'us'], ['keshi', 'yena,', 'although', 'tasty', 'reminded', 'me', 'barbequed', 'pulled', 'chicken'], ['restaurant', 'is', 'overrated']]

Answer 2

这是一种方法。您可能需要初始化stop_words以适合您的应用程序。我假设stop_words是小写的：因此，在原始句子上使用lower()进行比较。 sentences.lower().split('.')给出了句子。 s.split()给出每个句子中的单词列表。

stokens = [list(filter(lambda x: x not in stop_words, s.split())) for s in sentences.lower().split('.')]

您可能想知道我们为何使用filter和lambda。另一种选择是，但这将给出一个单一的清单，因此不适合：

stokens = [word for s in sentences.lower().split('.') for word in s.split() if word not in stop_words]

filter是一个函数式编程结构。在这种情况下，它可以帮助我们使用lambda语法通过匿名函数处理整个列表。

Answer 3

如果您不想手动创建停用词列表，我建议您在python中使用nltk库。它还处理句子分裂（而不是每个句点分裂）。解析句子的示例可能如下所示：

import nltk
stop_words = set(nltk.corpus.stopwords.words('english'))
text = "this was the biggest disappointment of our trip. the restaurant had received some very good reviews, so our expectations were high. the service was slow even though the restaurant was not very full. I had the house salad which could have come out of any sizzler in the us. the keshi yena, although tasty reminded me of barbequed pulled chicken. this restaurant is very overrated"
sentence_detector = nltk.data.load('tokenizers/punkt/english.pickle')
sentences = sentence_detector.tokenize(text.strip())
results = []
for sentence in sentences:
    tokens = nltk.word_tokenize(sentence)
    words = [t.lower() for t in tokens if t.isalnum()]
    not_stop_words = tuple([w for w in words if w not in stop_words])
    results.append(not_stop_words)
print results

但是，请注意，这不会提供与您的问题中列出的完全相同的输出，但是看起来像这样：

[('biggest', 'disappointment', 'trip'), ('restaurant', 'received', 'good', 'reviews', 'expectations', 'high'), ('service', 'slow', 'even', 'though', 'restaurant', 'full'), ('house', 'salad', 'could', 'come', 'sizzler', 'us'), ('keshi', 'yena', 'although', 'tasty', 'reminded', 'barbequed', 'pulled', 'chicken'), ('restaurant', 'overrated')]

如果输出需要看起来相同，您可能需要手动添加一些停用词。

Python列表的单词列表：

3 个答案: