Python删除文件行而不匹配另一个文件

时间:2016-03-21 13:19:34

标签: python string text match string-matching

我有两个文件,第一个包含必要的数据:1st file,第二个包含要保留的行列表:2nd file

我试图通过python代码进行过滤:

import os.path

# loading the input files
output    = open('descmat.txt', 'w+')
input     = open('descmat_all.txt', 'r')
lists      = open('training_lines.txt', 'r')
print "Test1"

# reading the input files
list_lines = lists.readlines()
list_input = input.readlines()

print "Test2"
output.write(list_input[0])

for i  in range(len(list_lines)):
    for ii in range(len(list_input)):
        position = list_input[ii].find(list_lines[i][:-1])
        if position > -1:
            output.write(list_input[ii])
        break 

print "Test3"
output.close()

但是这个脚本找不到任何匹配项。仅保留第一个文件中与第二个文件匹配的行的最简单方法是什么?

3 个答案:

答案 0 :(得分:2)

对于这类问题,Python具有set数据类型

# prepare a set of normalised training lines
# stripping new lines avoids possible problems with the last line

OK_lines = set(line.rstrip('\n') for line in open('training_lines.txt'))

# when you leave a with block, all the resources are released
# i.e., no need for file.close()

with open('descmat_all.txt') as infile:
    with open('descmat.txt', 'w') as outfile:
        for line in infile:
            # OK_lines have been stripped, input lines must be stripped as well
            if line.rstrip('\n') in OK_lines:
                outfile.write(line)

一个简单的测试

boffi@debian:~/Documents/tmp$ cat check.py 
# prepare a set of normalised training lines
# stripping new lines avoids possible problems with the last line

OK_lines = set(line.rstrip('\n') for line in open('training_lines.txt'))

# when you leave a with block, all the resources are released
# i.e., no need for file.close()

with open('descmat_all.txt') as infile:
    with open('descmat.txt', 'w') as outfile:
        for line in infile:
            # OK_lines have been stripped, input lines must be stripped as well
            if line.rstrip('\n') in OK_lines:
                outfile.write(line)

boffi@debian:~/Documents/tmp$ cat training_lines.txt 
ada
bob
boffi@debian:~/Documents/tmp$ cat descmat_all.txt 
bob
doug
ada
doug
eddy
ada
bob
boffi@debian:~/Documents/tmp$ python check.py
boffi@debian:~/Documents/tmp$ cat descmat.txt 
bob
ada
ada
bob
boffi@debian:~/Documents/tmp$ 

答案 1 :(得分:1)

如果您将文件都读入列表,则可以简单地比较列表。看here怎么做。 lightbulb, RES_16M, 711, 1, 16M lightbulb, RES_16Ms, 7112, 1, 16Mk card, CAP_2700pf, 75, 26, 2700pf card, CAP_2700pfs, 75, 262, 2700pff Current, ASDba, 0, 800, "doesn't follow trend" Current, TL741, 20, 12, "doesn't either" 应该包含可以匹配的字符串列表。

out

答案 2 :(得分:0)

替换这部分代码:

for i  in range(len(list_lines)):
    for ii in range(len(list_input)):
        position = list_input[ii].find(list_lines[i][:-1])
        if position > -1:
            output.write(list_input[ii])
        break 

由此:

for i  in range(len(list_lines)):
    for ii in range(len(list_input)):
        if list_input[ii][:26] == list_lines[i][:-1]:
            output.write(list_input[ii])

完全符合我的需要。