Question

我有以下代码：

import fileinput, os, glob, re

# Find text file to search in. Open.
filename = str(glob.glob('*.txt'))[2:][:-2]
print("found " + filename + ", opening...")
f = open(filename, 'r')

# Create output csv write total found occurrences of search string after name of search string 
with open(filename[:-4] + 'output.csv','w') as output:    
    output.write("------------Group 1----------\n")
    output.write(("String 1,") + str((len(re.findall(r's5 .*w249 w1025 w301 w1026 .*',f.read())))) +"\n")
    output.write(("String 1 reverse,") + str((len(re.findall(r's5 .*w1026 w301 w1025 w249 .*',f.read())))) +"\n")

# close and finish
f.close
output.close

它成功找到第一个字符串并将总计数写入输出文件，但它为'String 1 reverse'写入零查找，即使它应该找到1000个。

如果我在搜索String 1和String 1 reverse：

之间插入它，它会起作用

f.close
f = open(filename, 'r')

即。我关闭读取文件，然后再次打开它。

我不想在每个搜索行之后添加这个，发生了什么？是否与在正则表达式中缓存打开的文件或缓存有关？

由于

Answer 1

执行file.read()后，将读取整个文件，指针位于文件末尾;这就是为什么第二行不会返回任何结果。

您需要先阅读内容，然后运行分析：

print("found " + filename + ", opening...")
f = open(filename, 'r')
contents = f.read()
f.close()  # -- note f.close() not f.close

results_a = re.findall(r's5 .*w249 w1025 w301 w1026 .*',contents)
results_b = re.findall(r's5 .*w1026 w301 w1025 w249 .*',contents)

with open(filename[:-4] + 'output.csv','w') as output:    
    output.write("------------Group 1----------\n")
    output.write("String 1 {}\n".format(len(results_a)))
    output.write("String 1 reverse, {}\n".format(len(results_b)))

您不需要output.close（它首先没有做任何事情），因为with语句会自动关闭文件。

如果要对符合您的模式的所有文件重复此操作：

import glob
import re
import os

BASE_DIR = '/full/path/to/file/directory'

for file in glob.iglob(os.path.join(BASE_DIR, '*.txt')):
  with open(file) as f:
     contents = f.read()
     filename = os.path.splitext(os.path.basename(f))[0]
     results_a = re.findall(r's5 .*w249 w1025 w301 w1026 .*',contents)
     results_b = re.findall(r's5 .*w1026 w301 w1025 w249 .*',contents)
     with open(os.path.join(BASE_DIR, '{}output.csv'.format(filename), 'w') as output:
        output.write("------------Group 1----------\n")
        output.write("String 1 {}\n".format(len(results_a)))
        output.write("String 1 reverse, {}\n".format(len(results_b)))

Answer 2

我没有测试你的例子的样本，但我怀疑问题来自：

 output.write(("String 1,") + str((len(re.findall(r's5 .*w249 w1025 w301 w1026 .*',f.read())))) +"\n")
 output.write(("String 1 reverse,") + str((len(re.findall(r's5 .*w1026 w301 w1025 w249 .*',f.read())))) +"\n")

您正在执行f.read()两次，这意味着将读取整个文件，然后将光标设置在文件的末尾。第二个f.read()将返回一个空字符串，因为没有更多数据可供读取。

您必须记住，读取文件意味着在读取+n字节后，读取光标（附加到文件描述符的位置）将更改为n个字节。没有参数f.read()将读取整个文件大小字节，并将光标留在文件末尾。

您有两种解决方案：

将文件内容存储在变量中（例如：content = f.read()）并对该变量执行搜索。
使用文件搜索功能：

要更改文件对象的位置，请使用f.seek（offset，from_what）。通过向参考点添加偏移来计算位置;参数点由from_what参数选择。 from_what值为0，从文件开头开始，1使用当前文件位置，2使用文件末尾作为参考点。 from_what可以省略，默认为0，使用文件的开头作为参考点。

https://docs.python.org/3/tutorial/inputoutput.html

实际上建议使用第一个解决方案：您不需要多次读取文件，并且搜索功能主要用于大型文件操作。

以下是遵循该建议的代码的固定版本：

import fileinput, os, glob, re

# Find text file to search in. Open.
filename = str(glob.glob('*.txt'))[2:][:-2]
print("found " + filename + ", opening...")
content = open(filename, 'r').read()

# Create output csv write total found occurrences of search string after name of search string 
with open(filename[:-4] + 'output.csv','w') as output:    
    output.write("------------Group 1----------\n")
    output.write(("String 1,") + str((len(re.findall(r's5 .*w249 w1025 w301 w1026 .*',content)))) +"\n")
    output.write(("String 1 reverse,") + str((len(re.findall(r's5 .*w1026 w301 w1025 w249 .*',content)))) +"\n")

优化：请注意，您现在不需要close()变量，因为您没有引用文件实例。

关闭读取文件并再次打开，以便将搜索结果字符串写入输出文件

2 个答案: