Question

我有一个标有cstruct的文件夹，其中包含20,000个.rsa文件。在每个文件中，我需要提取包含cys值的每一行并将其写入新文件。有没有办法在python中循环浏览这个文件夹中的这些文件并提取这些信息？

RES SER A 102 17.74 15.2 17.22 22.0 0.52 1.4 11.89 24.5 5.85 8.6 RES HIS A 103 17.32 9.5 16.53 11.2 0.78 2.2 12.22 12.6 5.10 5.9 RES CYS A 104 0.00 0.0 0.00 0.0 0.00 0.0 0.00 0.0 0.00 0.0 RES LEU A 105 8.67 4.9 8.67 6.1 0.00 0.0 8.67 6.1 0.00 0.0 RES LEU A 106 5.72 3.2 5.72 4.1 0.00 0.0 5.72 4.0 0.00 0.0

Answer 1

类似下面的Python脚本应该让你朝着正确的方向前进：

import re, glob

with open("output.txt", "w") as f_output:
    for rsa_file in glob.glob(r"cstruct\*.rsa"):
        with open(rsa_file, "r") as f_input:
            f_output.write(rsa_file + "\n")
            for row in f_input:
                for cys in re.findall(r"(RES CYS\s+\w+.*?)(?= RES|\Z)", row):
                    f_output.write(cys+"\n")

Answer 2

当你使用builtin open（）命令打开一个文件并循环遍历它时，默认情况下Python循环遍历文件中的每一行：

dirName = "C:\\Wherever\\Your\\Files\\Are"
for rsafile in os.listdir(dirName):
    filepath = os.path.join(dirName, rsafile)     
    with open(filepath, "r") as f:
        for line in f:
            if "CYS" in line:
                print(line)

根据您的“行”的定义方式，您可能需要在识别相关行后从每行中拉出相关的CYS子字符串。

为了好玩，我比较了这种方法的速度（如果“模式”在线）与正则表达式方法的速度，re.search（“。* CYS。*”，line）。
对于小文件，在我的笔记本电脑上，Python“in”运算符平均快了约91倍（100次迭代）。
Regex re.search运行时间：0.093秒。
“in”操作员运行时间：0.001秒。
那是timeit模块的时间。时序数据包括文件打开/关闭开销，因此差异完全是由于匹配方法。

从.rsa文件中提取数据

2 个答案: