我想删除列表列表中的停用词,同时保持格式相同(即列表列表)
以下是我已经尝试过的代码
sent1 = 'I have a sentence which is a list'
sent2 = 'I have a sentence which is another list'
from nltk.corpus import stopwords
stop_words = stopwords.words('english')
lst = [sent1, sent2]
sent_lower = [t.lower() for t in lst]
filtered_words=[]
for i in sent_lower:
i_split = i.split()
lst = []
for j in i_split:
if j not in stop_words:
lst.append(j)
" ".join(lst)
filtered_words.append(lst)
filtered_words的当前输出:
filtered_words
[['sentence', 'list'],
['sentence', 'list'],
['sentence', 'another', 'list'],
['sentence', 'another', 'list'],
['sentence', 'another', 'list']]
filtered_words的所需输出:
filtered_words
[['sentence', 'list'],
['sentence', 'another', 'list']]
我得到的清单重复。我在循环中可能做错了什么?还有一种比写那么多的for循环更好的方法吗?
答案 0 :(得分:3)
您做错了什么,就是每次发现不停用词时都将lst
附加到filtered_words
上。因此,您有2个重复的已过滤sent1
(包含2个非停用词)和3个重复的已过滤sent2
(包含3个非停用词)。
在检查完每个句子之后,只需追加:
for i in sent_lower:
i_split = i.split()
lst = []
for j in i_split:
if j not in stop_words:
lst.append(j)
filtered_words.append(lst)
顺便说一句
" ".join(lst)
没什么用,因为您正在计算某些东西(一个字符串),但没有将其存储在任何地方。
编辑
使用列表理解功能的另一种Python方式:
for s in sent_lower:
lst = [j for j in s.split() if j not in stop_words]
filtered_words.append(lst)
答案 1 :(得分:1)
一旦在itertools
-
filtered_words
import itertools
filtered_words.sort()
list(filtered_words for filtered_words,_ in itertools.groupby(filtered_words))
输出结果为-
[['sentence','another','list'], ['entent','list']]
我在StackOverflow上关注了一个链接-Remove duplicates from a list of list
答案 2 :(得分:0)
这将为您提供所需的结果
create or replace procedure doUpdate(
user_list in S_USER_OBJ_LIST,
user_out out SYS_REFCURSOR
) is
begin
...
-- set OUT param
open user_out for select * from users;
end;