Question

我有一个包含约300,000个关键字（单个和多个单词）和另一个单个单词列表的列表。我还有~1200个包含多行文本的文件。我需要检查是否在任何文件中，这两个列表中的文字都是彼此靠近的。通过附近，我的意思是这两个词大约相隔10个字或更少。第一个列表中的字符串示例为NOD.BDC2.5 transgenic mice，另一个列表中的字符串为inhibition。

有什么想法吗？我已经广泛搜索但找不到任何东西。另外，由于这些是多字符串，我不能使用abs（array.index）作为两个字符串（单字可能使用）。

由于

Answer 1

你可以通过解决问题来简化问题。首先，循环遍历包含句子的文件，然后检查一行是否包含第二个文件file2中的任何单词，因为它的条目较少。如果是，则检查第一个文件file1中是否存在单词。

现在，使用re.split函数将该行划分为组成单词。找到两个条目的第一个单词的索引，然后减去它们以查看它们是否小于10个单词。这很简单，因为你的第二个列表只有单个字符串。

以下是示例代码 -

for s in sentences:
            s = s.rstrip()
                if f2 in s:  # f1 is an entry from file2
                    l = re.split(';|,|-| ', s) # split the line by comma, semicolon and space 
                    for f1 in file1: 
                        f1 = f1.rstrip()     #remove lagging whitespace characters                   

                        if f1 in s: # search for f1 in line

                            if ((f1 in l) and (f2 in l)): # ensure both are in the list                                                                                            

                                r = abs(l.index(f1) - l.index(f2)) # find distance between first character of f1 and f2
                                if r<=10:
                                        match found

确定字符列表之间的邻域

1 个答案: