如何检查字符串列表是否存在任何元组

时间:2019-06-05 21:24:55

标签: python pandas numpy

我正在尝试检查另一个字符串列表中存在一个元组列表中有多少个元组,并且我还希望所有与特定字符串相关联的元组。

示例:

A = [["is", "a"],["is", "going"],["to", "the"]]

B = ["This is a bat", "Adam is going to the school"]

我想要根据哪个元组存在于哪个字符串中的结果。 期望的输出

["is", "a"] exists in "This is a bat"

["is", "going"] and ["to", "the"] exists in "Adam is going to the school"

我尝试了下面的代码,但是它只给出了列表B中存在的所有元组。

A = [["is", "a"],["is", "going"],["to", "the"]]
B = ["This is a bat", "Adam is going to the school"]
matching = [s for s in B if any(x in s for x in A)]

编辑: 尝试了另一种方法

A = [["is", "a"],["is", "going"],["to", "the"]]
B = ["This is a bat", "Adam is going to the school"]
    for i in range(len(B)):
        flag = False
        keywords = ""
        for a in A:
            if a[0]+" "+a[1] in B[i]:
                if(keywords == ""):
                    keywords = a[0]+" "+a[1] 
                else:
                    keywords = keywords + ", " + a[0]+" "+a[1] 
        print(keywords)

这种方法效果很好,是否可以进一步优化?

4 个答案:

答案 0 :(得分:0)

import re

A = [["is", "a"], ["is", "going"], ["to", "the"]]
joined_A = [" ".join(a) for a in A]

B = ["This is...a bat!", "Adam is going to the school"]

for s in B:
    print(s)
    normalised = " ".join(re.findall(r"\b\w+\b", s))

    for a, joined in zip(A, joined_A):
        if joined in normalised:
            print(a)

输出:

This is...a bat!
['is', 'a']
Adam is going to the school
['is', 'going']
['to', 'the']

答案 1 :(得分:0)

您可以使用''.join()合并A中的列表,然后查看它是否在B中的字符串中:

/user/vaishak/sales.db

输出:

from collections import defaultdict

matches = defaultdict(list)
for match_string in A:
    for string in B:
        if ' '.join(match_string) in string:
            matches[string].append(match_string)

答案 2 :(得分:0)

代码:

from re import search

A = [["is", "a"], ["is", "going"], ["to", "the"]]
B = ["This is a bat", "Adam is going to the school"]

p_A = ["\\s".join(i) for i in A]
matches = [(s, [A[i] for i, e in enumerate(p_A) if search(e, s)]) for s in B]
for string, patterns in matches:
    print("%s exists in \"%s\"" % (" and ".join("[%s]" % ", ".join("\"%s\"" % s for s in p) for p in patterns), string))

输出:

["is", "a"] exists in "This is a bat"
["is", "going"] and ["to", "the"] exists in "Adam is going to the school"

答案 3 :(得分:0)

from sklearn.feature_extraction.text import CountVectorizer
from io import StringIO

A = [["is", "a"], ["is", "going"], ["to", "the"]]
joined_A = [" ".join(a) for a in A]

B = ["This is a bat", "Adam is going to the school"]

def tokenizer(text):
    return text.split()

vec = CountVectorizer(ngram_range = (1,2), tokenizer = tokenizer)

for string in B:
    vec.fit_transform(StringIO(string))
    wordlist = []
    for word, prev in zip(joined_A, A):
        if word in vec.get_feature_names():
            wordlist.append(prev)
    print(f"{wordlist} exists in {string}")


#Output
[['is', 'a']] exists in This is a bat
[['is', 'going'], ['to', 'the']] exists in Adam is going to the school

我正在使用sklearns CountVectorizer创建字符串中所有单词的列表,然后检查我们的单词是否存在。

我在A中使用@Alex Hall的单词连接,因为您希望它们被视为一对。

我们通常不必定义功能标记器,但是由于sklearns CountVectorizer会自动删除某些词(例如“ a”),即使停用词已关闭,我们也需要在此处定义它。

我们的计数向量器中的

ngram_range更改了多少个连续的单词被视为唯一。如果我们将其保留为1,则只有1个字。 1,2最多2个字。您必须将数字2增加到与A中最长的列表一样大。如果您所有的A连词都是2个单词,则需要将其设置为(2,2)以提高速度。

最后,我们只是在字符串周围循环。我们使用get_feature_names从B提取所有不同的字符串,然后检查A是否存在。然后最后打印答案。