如何在不使用Python进行重复复制的情况下从文件中提取数据

时间:2018-10-26 23:31:12

标签: python-2.7 dictionary

我目前拥有一个脚本,该脚本非常适合根据第二个文件(白名单)中的关键字从一个文件中提取数据,并将提取的数据写到第三个文件中

import sys
import csv

input_file = csv.DictReader(open(sys.argv[1], "rU"))

white_list_file = csv.DictReader(open(sys.argv[2], "rU"))

output_file = csv.DictWriter(open(sys.argv[3], "w"), input_file.fieldnames)

output_file.writeheader()

white_list = {} #load empty dictionary

for record in white_list_file:
    white_list[record["key_word"]] = None

for record in input_file: #for every item in my input file
    record_id = record["key_word"] #assign column with key word from input file as a variable
    if (record_id in (white_list)): # if this key word is in my white list,
        output_file.writerow(record)   # then I write the whole line in my output file

    else:   # if not, then ignore this line and move on to the next line
    continue

但是,输出文件的结果是原始输入文件的重复版本。过去,这对我来说效果很好,但现在我需要一个不会删除重复结果的新脚本。

因此,如果我的输入文件在3个不同的行中都有一个关键字,我希望我的输出文件也具有该关键字和相关信息3次。

我尝试解决使用“计数器”方法修改脚本的过程,以尝试计算在白名单中找到关键字的次数,但这无法正常工作或产生预期的结果。

是否有一种简单的方法来修改脚本,以使输出文件不被重复复制?

1 个答案:

答案 0 :(得分:0)

使用给定的代码here,您可以实现所需的输出,如下所示:输入文件名为 data.csv ,文件中也可以有空格:

HEADER    Signaling Protein                       03-May-12   4F0A
TITLE     Crystal Structure Of Xwnt8 In Complex With The Cysteine    
TITLE    2 rich Domain Of Frizzled 8                                  
AUTHOR    C.Y.Janda,D.Waghray,A.M.Levin,C.Thomas,K.C.Garcia          
REMARK = 1 NCBI PDB FORMAT VERSION 6.0
REMARK = 2 NOTE:  NCBI-MMDB PDB-Format File derived from ASN.1
REMARK = 3 Refer to original ASN.1 file or PDB file for data records


HELIX    1   1 GLN A   62  HIS A   70  1                                    9
HELIX    2   2 PHE A   72  GLN A   79  1                                    8
HELIX    3   3 LEU A   84  TYR A   92  1                                    9
HELIX    4   4 SER A  109  TYR A  125  1                                   17 
HELIX    1   1 PRO B   34  ALA B   42  1                                    9
HELIX    2   2 SER B   43  PHE B   59  1                                   17
HELIX    3   3 ARG B   84  SER B  106  1                                   23
HELIX    4   4 ALA B  137  PHE B  147  1                                   11
HELIX    5   5 ALA B  157  GLU B  175  1                                   19
HELIX    6   6 PHE B  202  GLN B  215  1                                   14
HELIX    7   7 GLY B  236  SER B  244  1                                    9
ATOM      1  N   CYS A  35     -46.772 -32.953  13.444  1.00118.86           N  
ATOM      2  CA  CYS A  35     -45.589 -33.712  13.063  1.00132.02           C  
ATOM      3  C   CYS A  35     -45.956 -34.934  12.237  1.00141.34           C  
ATOM      4  O   CYS A  35     -47.000 -35.548  12.450  1.00140.11           O  
SEQRES = 1 A  132  ALA SER ALA LYS GLU LEU ALA CYS GLN GLU ILE THR VAL
SEQRES = 2 A  132  PRO LEU CYS LYS GLY ILE GLY TYR ASN TYR THR TYR MET
SEQRES = 25 B  316  HIS PHE CYS ALA
ATOM      5  CB  CYS A  35     -44.802 -34.155  14.301  1.00137.04           C  
ATOM      6  SG  CYS A  35     -43.999 -32.812  15.204  1.00163.69           S  
ATOM      7  N   GLN A  36     -45.100 -35.263  11.277  1.00149.21           N  
ATOM      8  CA  GLN A  36     -45.159 -36.550  10.594  1.00144.14           C  
ATOM      9  C   GLN A  36     -43.746 -37.119  10.503  1.00143.70           C  
SHEET    1   A 1 CYS A  35  ILE A 38  0
SHEET    2   A 1 ASN A  49  TYR A 52  0
SHEET    1   B 1 GLY B 121  ARG B126  0
SHEET    2   B 1 GLY B 127  GLY B131  0
SHEET    3   B 1 THR B 176  HIS B184  0

您要从中提取文件 keys.txt 中存在的以下密钥:

REMARK
HELIX
SEQRES
SHEET

为此,可以使用以下代码:

#!/usr/bin/python
with open('data.csv', 'r') as sourcefile:
     source = sourcefile.read().splitlines()

with open('keys.txt', 'r') as keyfile:
     keys = keyfile.read().split()

with open('MyOutFile', 'w') as outfile:
     for line in source:
         if line.split():
             if line.split()[0] in keys:
                 outfile.write(line + "\n")
outfile.close()

这会将keys.txt中的键提取为:

REMARK = 1 NCBI PDB FORMAT VERSION 6.0
REMARK = 2 NOTE:  NCBI-MMDB PDB-Format File derived from ASN.1
REMARK = 3 Refer to original ASN.1 file or PDB file for data records
HELIX    1   1 GLN A   62  HIS A   70  1                                    9
HELIX    2   2 PHE A   72  GLN A   79  1                                    8
HELIX    3   3 LEU A   84  TYR A   92  1                                    9
HELIX    4   4 SER A  109  TYR A  125  1                                   17 
HELIX    1   1 PRO B   34  ALA B   42  1                                    9
HELIX    2   2 SER B   43  PHE B   59  1                                   17
HELIX    3   3 ARG B   84  SER B  106  1                                   23
HELIX    4   4 ALA B  137  PHE B  147  1                                   11
HELIX    5   5 ALA B  157  GLU B  175  1                                   19
HELIX    6   6 PHE B  202  GLN B  215  1                                   14
HELIX    7   7 GLY B  236  SER B  244  1                                    9
SEQRES = 1 A  132  ALA SER ALA LYS GLU LEU ALA CYS GLN GLU ILE THR VAL
SEQRES = 2 A  132  PRO LEU CYS LYS GLY ILE GLY TYR ASN TYR THR TYR MET
SEQRES = 25 B  316  HIS PHE CYS ALA
SHEET    1   A 1 CYS A  35  ILE A 38  0
SHEET    2   A 1 ASN A  49  TYR A 52  0
SHEET    1   B 1 GLY B 121  ARG B126  0
SHEET    2   B 1 GLY B 127  GLY B131  0
SHEET    3   B 1 THR B 176  HIS B184  0

这将解决您的问题。