Question

我有两个txt文件，一个非常大（txt文件1），包含15000个句子，每行都以一种固定的格式（句子索引，单词，标签）分解。我还有另一个文本文件（txt文件2），其中包含约500个句子，分为格式（句子索引，单词）。我想从“ txt文件1”中的“ txt文件2”中找到句子，但是我还需要提取标签。

txt文件1的

格式：

1   Flurazepam  O
2   thus    O
3   appears O
4   to  O
5   be  O
6   an  O
7   effective   O
8   hypnotic    O
9   drug    O
10  with    O

txt文件2的

格式：

1   More
2   importantly
3   ,
4   this
5   fusion
6   converted
7   a
8   less
9   effective
10  vaccine

最初，我只是尝试了一些愚蠢的事情：

txtfile1=open("/Users/Desktop/Final.txt").read().split('\n')


with open ('/Users/Desktop/sentenceineed.txt','r') as txtfile2:

   whatineed=[]
   for line in txtfile2:
       for part in txtfile1:
           if line == part: 
               whatineed.append(part)

我没有任何尝试，实际上是一个空列表。任何建议都很好。

Answer 1

由于第一个文件比第二个大得多，因此您要避免一次将第一个文件全部放入内存。将第二个文件放入内存中没有问题。字典将是此内存的理想数据类型，因为您可以快速找到字典中是否存在单词，并可以快速检索其索引。

以这种方式思考您的问题-查找第一个文本文件中的所有单词以及第二个文本文件中的所有单词。因此，这是伪代码中的一种算法。您没有指定“输出”的完成方式，因此我仅将其称为“存储”。您没有说明单词的“索引”是否要出现在输出中，因此我将其放在此处。如果需要的话，将其删除很简单。

Initialize a dictionary to empty
for each line in text_file_2:
    parse the index and the word
    Add the word as the key and the index as the value to the dictionary
Initialize the storage for the final result
for each line in text_file_1:
    parse the index, word, and tag
    if the word exists in the dictionary:
        retrieve the index from the dictionary
        store the word, tag, and both indices

这是该算法的代码。为了便于理解和调试，我将其“扩展”而不是使用理解。

dictfile2 = dict()
with open('txtfile2.txt') as txtfile2:
    for line2 in txtfile2:
        index2, word2 = line2.strip().split()
        dictfile2[word2] = index2
listresult = list()
with open('txtfile1.txt') as txtfile1:
    for line1 in txtfile1:
        index1, word1, tag1 = line1.strip().split()
        if word1 in dictfile2:
            index2 = dictfile2[word1]
            listresult.append((word1, tag1, int(index1), int(index2)))

在给定示例数据的情况下，这是print(listresult)的代码结果。您可能需要其他格式的结果。

[('effective', 'O', 7, 9)]

Answer 2

@Rory Daulton正确指出了这一点。由于您的第一个文件可能足够大，可以完全将其加载到内存中，因此您应该对其进行迭代。

在这里，我正在写我的解决方案。您可以对实现进行必要/所需的更改。

程序

dict_one = {} # Creating empty dictionary for Second File
textfile2 = open('textfile2', 'r') 

# Reading textfile2 line by line and adding index and word to dictionary
for line in textfile2:
    values = line.split(' ')
    dict_one[values[0].strip()] = values[1].strip()

textfile2.close()

outfile = open('output', 'w') # Opening file for output
textfile1 = open('textfile1', 'r') # Opening first file

# Reading first file line by line
for line in textfile1:
    values = line.split(' ') 
    word = values[1].strip() # Extracting word from the line

    # Matching if word exists in dictionary
    if word in dict_one.values():
        # If word exists then writing index, word and tag to the output file
        outfile.write("{} {} {}\n".format(values[0].strip(), values[1].strip(), values [2].strip()))

outfile.close()
textfile1.close()

文本文件1

1 Flurazepam O
2 thus O
3 appears I
4 to O
5 be O
6 an O
7 effective B
8 hypnotic B
9 drug O
10 less O
11 converted I
12 maxis O
13 fusion I
14 grave O
15 public O
16 mob I
17 havoc I
18 boss O
19 less B
20 diggy I

文本文件2

1 More
2 importantly
3 ,
4 this
5 fusion
6 converted
7 a
8 less
9 effective
10 vaccine

输出文件

7 effective B
10 less O
11 converted I
13 fusion I
19 less B

在这里，less出现了两次，带有与数据文件中相同的标签。希望这就是您想要的。

Answer 3

假设文本文件中的间距保持一致

import re

#open your files
text_file1 = open('txt file 1.txt', 'r')
text_file2 = open('txt file 2.txt', 'r')
#save each line content in a list like l = [[id, word, tag]]
text_file_1_list = [l.strip('\n') for l in text_file1.readlines()]
text_file_1_list = [" ".join(re.split("\s+", l, flags=re.UNICODE)).split('') for l in text_file_1_list] 
#similarly save all the words in text file in list
text_file_2_list = [l.strip('\n') for l in text_file2.readlines()]
text_file_2_list = [" ".join(re.split("\s+", l, flags=re.UNICODE)).split(' ')[1] for l in text_file_2_list]
print(text_file_2_list)  
# Now just simple search algo btw these two list
words_found = [[l[1], l[2]] for l in text_file_1_list if l[1] in text_file_2_list]
print(words_found)
# [['effective', 'O']]

我认为应该起作用。

Answer 4

由于比较时正在使用句子索引，因此找不到指定句子的出现。因此，只有当第二个文件中的一个句子与相同的索引进行比较时，第二个文件中的一个句子才会出现在第一个文件中

#file1
3 make tag
7 split tag

#file2
4 make 
6 split

您通过以下方式if line == part来对它们进行编排：但是显然 4个制造商不等于 3个制造商标记，因为您拥有 3 代替 4 ，另外还会使标记部分失效。

因此只需更改条件即可检索正确的句子。

def selectSentence(string):
  """Based on the strings that you have in the example. 
  I assume that the elements are separated by one space char
  and that in the sentences aren't spaces"""
  elements = string.split(" ")
  return elements[1].strip()

txtfile1 = open("file1.txt").read().split('\n')
with open ('file2.txt','r') as txtfile2:

   whatineed=[]
   for line in txtfile2:
       for part in txtfile1:
         if selectSentence(line) == selectSentence(part): 
            whatineed.append(part)

print(whatineed)

我的方法

就像@Rory Daulton的端点一样，文件很大，因此将其全部加载到内存中是一个坏主意。一个更好的主意是遍历它，同时您可以存储所需的小文件数据（第二个文件）。

txtfile2 = open("file2.txt").read().split('\n')
sentences_inf2 = {selectSentence(line) for line in txtfile2} #set to remove duplicates
with open ('file1.txt','r') as txtfile1:

   whatineed=[]
   for line in txtfile1:
         if selectSentence(line) in sentences_inf2: 
            whatineed.append(line.strip())

print(whatineed) #['7 effective O']

从另一个文本文件中的一个文本文件中提取句子

4 个答案:

程序

文本文件1

文本文件2

输出文件

我的方法