Question

我从新闻网站解析信息。每条新闻都是一个存储在translated_news变量中的字典。每个新闻都有其标题，网址和国家/地区。然后我尝试迭代每个新闻标题并删除停用词和标点符号。我写了这段代码：

for new in translated_news:
    tk = tokenize(new['title'])
    # delete punctuation signs & stop-words
    for t in tk:
        if (t in punkts) or (t+'\n' in stops):
            tk.remove(t)
tokens.append(tk)

Tokenize是一个返回令牌列表的函数。这是输出的一个例子：

['medium', ':', 'russian', 'athlete', 'will', 'be', 'admit', 'to', 'the', '2018', 'olympics', 'in', 'neutral', 'status']

这里有相同的输出，但删除了停用词和标点符号：

['medium', 'russian', 'athlete', 'be', 'admit', 'the', 'olympics', 'neutral', 'status']

问题是：即使是“＆＃39;并且＆＃39;是＆＃39;我的停用词列表中包含它们，它们没有从新闻标题中删除。但是，在其他标题上，它有时可以正常工作：

['wada', 'acknowledge', 'the', 'reliable', 'information', 'provide', 'to', 'rodchenkov'] ['wada', 'acknowledge', 'reliable', 'information', 'provide', 'rodchenkov']

这里＆＃39;＆＃39;已从标题中删除。我不明白代码有什么问题，为什么有时输出是完美的，有时候不是。

Answer 1

您必须迭代tokenize(new['title'])并使用De Morgan's laws来简化if语句：

import string

stops = ['will', 'be', 'to', 'the', 'in']

tk = ['medium', ':', 'russian', 'athlete', 'will', 'be', 'admit', 'to', 'the',
      '2018', 'olympics', 'in', 'neutral', 'status']

# delete punctuation signs & stop-words
tk = []
for t in tokenize(new['title']):
    # if not ((t in string.punctuation) or (t in stops)):
    if (t not in string.punctuation) and (t not in stops): # De Morgan's laws
        tk.append(t)
print(tk)

将打印：

['medium', 'russian', 'athlete', 'admit', '2018', 'olympics', 'neutral', 'status']

你可以在停用词中删除新行：

stops = ['will\n', 'be\n', 'to\n', 'the\n', 'in\n']
stops = [item.strip() for item in stops]
print(stops)

将打印：

['will', 'be', 'to', 'the', 'in']

incanus86 建议的解决方案确实有效：

tk = [x for x in tokenize(new['title']) if x not in stops and x not in string.punctuation]

但如果你知道list comprehensions，你就不会问。

我不明白代码有什么问题，为什么有时输出是完美的，有时候不是。

在对tk项进行迭代时，您会错过'be'和'the'，因为您正在移除代码中显示的tk项：

import string

stops = ['will', 'be', 'to', 'the', 'in']

tk = [
    'medium',  # 0
    ':',  # 1
    'russian',  # 2
    'athlete',  # 3
    'will',  # 4
    'be',  # 5
    'admit',  # 6
    'to',  # 7
    'the',  # 8
    '2018',  # 9
    'olympics',  # 10
    'in',  # 11
    'neutral',  # 12
    'status'  # 13
]

# delete punctuation signs & stop-words
for t in tk:
    print(len(tk), t, tk.index(t))
    if (t in string.punctuation) or (t in stops):
        tk.remove(t)

print(tk)

将打印：

(14, 'medium', 0)
(14, ':', 1)
(13, 'athlete', 2)
(13, 'will', 3)
(12, 'admit', 4)
(12, 'to', 5)
(11, '2018', 6)
(11, 'olympics', 7)
(11, 'in', 8)
(10, 'status', 9)
['medium', 'russian', 'athlete', 'be', 'admit', 'the', '2018', 'olympics', 'neutral', 'status']

你确实错过“俄语”，“be”，“the”和“中立”。
“sports”的索引是2，“will”的索引是3，因为你从tk中删除了“：” “admit”的索引是4，如果“to”是5则索引，因为你从tk中删除了“will” “2018”指数为6，“奥运”指数为7，“在”指数为8，“状态”指数为9。

迭代时不得更改列表！

Answer 2

尝试删除换行符。

类似这样的事情

tk = [x for x in tokenize(new['title']) if x not in stops and x not in string.punctuation]

删除停用词和标点符号

2 个答案: