如何检查列表中的多个项目是否出现在字符串中?

时间:2016-09-16 22:47:38

标签: python

我们说我有一个关键字列表:

keywords = ["history terms","history words","history vocab","history words terms","history vocab words","science list","science terms vocab","math terms words vocab"]

主要条款清单:

`main_terms = ["terms","words","vocab","list"]`

更新以更清楚地说明问题:

我制作的脚本是从一长串关键字中删除近似重复项。我设法删除了拼写错误和轻微的变体(例如" hitsory terms","历史字词")。

我的问题是因为我在这个关键字列表中找到了多个字词,但是我在关键字中找到了这些字词之一(例如, "历史条款")所有相同的关键字除了不同的术语或术语组合外(例如"历史词汇","历史词"," ;历史词汇术语"等)应视为重复。

  • 可以在关键字中包含多个字词(例如"数学术语字词词汇")只要没有相同的关键字除了拥有较低的条款数量(例如"数学术语单词"或理想情况下单个术语,如"数学词汇")。

3 个答案:

答案 0 :(得分:1)

循环浏览keywords并针对main_terms

检查每一个
keywords = ["history terms",
            "history words",
            "history vocab",
            "history words terms",
            "history vocab words",
            "science list",
            "science terms vocab",
            "math terms words vocab"]
main_terms = {"terms","words","vocab","list"}
result = {}
for words in keywords:
    s = set(words.split())
    s_subject = s - main_terms
    subject = s_subject and next(iter(s_subject))
    if s | main_terms and subject and subject not in result:
        result[subject] = words

将结果值转换为列表:

>>> list(result.values())
['math terms words vocab', 'history terms', 'science list']

答案 1 :(得分:0)

我确信这是一个更优雅的解决方案,但这似乎是您正在寻找的解决方案,至少对于第1部分而言:

>>> def remove_main_terms(keyword):
        words = keyword.split()
        count = 0
        to_keep = []
        for word in words:
            if word in main_terms:
                count += 1
            if count < 2:
                to_keep.append(word)
            else:
                pass
        return " ".join(to_keep)

>>> keywords = ["history terms","history words","history vocab","history words terms","history vocab words","science list","science terms vocab","math terms words vocab"]

>>> main_terms = ["terms","words","vocab","list"]

>>> new_list = []
>>> for w in keywords:
        new_list.append(remove_main_terms(w))

>>> new_list
['history terms', 'history words', 'history vocab', 'history words', 'history vocab', 'science list', 'science terms', 'math terms']

答案 2 :(得分:0)

编辑:我越来越以为你在问XY Question而你想要独特的主题。

如果是这种情况,以下情况会更好:

result = []
found = []
for word in keywords:
    for term in main_terms:
        if term in word:
            word = word.replace(term, "")
    result.append(word.strip())

print set(result)

哪个输出set(['science', 'math', 'history'])

这可以用相同的结果解决您的原始问题,但是在第一个之后忽略术语并且只传递唯一的第一个单词。

result = []
found = []
for word in keywords:
    found = False
    for res in result:
        if word.split()[0] in res:
            found = True
    if not found:
        result.append(word)
print result

请参阅repl.it

上的演示