我们说我有一个关键字列表:
keywords = ["history terms","history words","history vocab","history words terms","history vocab words","science list","science terms vocab","math terms words vocab"]
主要条款清单:
`main_terms = ["terms","words","vocab","list"]`
更新以更清楚地说明问题:
我制作的脚本是从一长串关键字中删除近似重复项。我设法删除了拼写错误和轻微的变体(例如" hitsory terms","历史字词")。
我的问题是因为我在这个关键字列表中找到了多个字词,但是我在关键字中找到了这些字词之一(例如, "历史条款")所有相同的关键字除了不同的术语或术语组合外(例如"历史词汇","历史词"," ;历史词汇术语"等)应视为重复。
答案 0 :(得分:1)
循环浏览keywords
并针对main_terms
keywords = ["history terms",
"history words",
"history vocab",
"history words terms",
"history vocab words",
"science list",
"science terms vocab",
"math terms words vocab"]
main_terms = {"terms","words","vocab","list"}
result = {}
for words in keywords:
s = set(words.split())
s_subject = s - main_terms
subject = s_subject and next(iter(s_subject))
if s | main_terms and subject and subject not in result:
result[subject] = words
将结果值转换为列表:
>>> list(result.values())
['math terms words vocab', 'history terms', 'science list']
答案 1 :(得分:0)
我确信这是一个更优雅的解决方案,但这似乎是您正在寻找的解决方案,至少对于第1部分而言:
>>> def remove_main_terms(keyword):
words = keyword.split()
count = 0
to_keep = []
for word in words:
if word in main_terms:
count += 1
if count < 2:
to_keep.append(word)
else:
pass
return " ".join(to_keep)
>>> keywords = ["history terms","history words","history vocab","history words terms","history vocab words","science list","science terms vocab","math terms words vocab"]
>>> main_terms = ["terms","words","vocab","list"]
>>> new_list = []
>>> for w in keywords:
new_list.append(remove_main_terms(w))
>>> new_list
['history terms', 'history words', 'history vocab', 'history words', 'history vocab', 'science list', 'science terms', 'math terms']
答案 2 :(得分:0)
如果是这种情况,以下情况会更好:
result = []
found = []
for word in keywords:
for term in main_terms:
if term in word:
word = word.replace(term, "")
result.append(word.strip())
print set(result)
哪个输出set(['science', 'math', 'history'])
这可以用相同的结果解决您的原始问题,但是在第一个之后忽略术语并且只传递唯一的第一个单词。
result = []
found = []
for word in keywords:
found = False
for res in result:
if word.split()[0] in res:
found = True
if not found:
result.append(word)
print result
请参阅repl.it
上的演示