Question

我想删除列表列表中的停用词，同时保持格式相同（即列表列表）

以下是我已经尝试过的代码

sent1 = 'I have a sentence which is a list'
sent2 = 'I have a sentence which is another list'

from nltk.corpus import stopwords
stop_words = stopwords.words('english')

lst = [sent1, sent2]
sent_lower = [t.lower() for t in lst]

filtered_words=[]
for i in sent_lower:
    i_split = i.split()
    lst = []
    for j in i_split:
        if j not in stop_words:
            lst.append(j)
            " ".join(lst)
            filtered_words.append(lst)

filtered_words的当前输出：

filtered_words
[['sentence', 'list'],
 ['sentence', 'list'],
 ['sentence', 'another', 'list'],
 ['sentence', 'another', 'list'],
 ['sentence', 'another', 'list']]

filtered_words的所需输出：

filtered_words
[['sentence', 'list'],
 ['sentence', 'another', 'list']]

我得到的清单重复。我在循环中可能做错了什么？还有一种比写那么多的for循环更好的方法吗？

Answer 1

您做错了什么，就是每次发现不停用词时都将lst附加到filtered_words上。因此，您有2个重复的已过滤sent1（包含2个非停用词）和3个重复的已过滤sent2（包含3个非停用词）。在检查完每个句子之后，只需追加：

for i in sent_lower:
    i_split = i.split()
    lst = []
    for j in i_split:
        if j not in stop_words:
            lst.append(j)
    filtered_words.append(lst)

顺便说一句

" ".join(lst)

没什么用，因为您正在计算某些东西（一个字符串），但没有将其存储在任何地方。

编辑

使用列表理解功能的另一种Python方式：

for s in sent_lower:
    lst = [j for j in s.split() if j not in stop_words]
    filtered_words.append(lst)

Answer 2

一旦在itertools-

中有重复的结果，就可以使用filtered_words

import itertools
filtered_words.sort()
list(filtered_words for filtered_words,_ in itertools.groupby(filtered_words))

输出结果为-

[['sentence'，'another'，'list']， ['entent'，'list']]

我在StackOverflow上关注了一个链接-Remove duplicates from a list of list

Answer 3

这将为您提供所需的结果

create or replace procedure doUpdate(
  user_list in S_USER_OBJ_LIST,
  user_out out SYS_REFCURSOR
) is
begin
  ...
  -- set OUT param
  open user_out for select * from users;
end;

删除列表中的停用词

3 个答案: