我目前拥有一个脚本,该脚本非常适合根据第二个文件(白名单)中的关键字从一个文件中提取数据,并将提取的数据写到第三个文件中
import sys
import csv
input_file = csv.DictReader(open(sys.argv[1], "rU"))
white_list_file = csv.DictReader(open(sys.argv[2], "rU"))
output_file = csv.DictWriter(open(sys.argv[3], "w"), input_file.fieldnames)
output_file.writeheader()
white_list = {} #load empty dictionary
for record in white_list_file:
white_list[record["key_word"]] = None
for record in input_file: #for every item in my input file
record_id = record["key_word"] #assign column with key word from input file as a variable
if (record_id in (white_list)): # if this key word is in my white list,
output_file.writerow(record) # then I write the whole line in my output file
else: # if not, then ignore this line and move on to the next line
continue
但是,输出文件的结果是原始输入文件的重复版本。过去,这对我来说效果很好,但现在我需要一个不会删除重复结果的新脚本。
因此,如果我的输入文件在3个不同的行中都有一个关键字,我希望我的输出文件也具有该关键字和相关信息3次。
我尝试解决使用“计数器”方法修改脚本的过程,以尝试计算在白名单中找到关键字的次数,但这无法正常工作或产生预期的结果。
是否有一种简单的方法来修改脚本,以使输出文件不被重复复制?
答案 0 :(得分:0)
使用给定的代码here,您可以实现所需的输出,如下所示:输入文件名为 data.csv ,文件中也可以有空格:
HEADER Signaling Protein 03-May-12 4F0A
TITLE Crystal Structure Of Xwnt8 In Complex With The Cysteine
TITLE 2 rich Domain Of Frizzled 8
AUTHOR C.Y.Janda,D.Waghray,A.M.Levin,C.Thomas,K.C.Garcia
REMARK = 1 NCBI PDB FORMAT VERSION 6.0
REMARK = 2 NOTE: NCBI-MMDB PDB-Format File derived from ASN.1
REMARK = 3 Refer to original ASN.1 file or PDB file for data records
HELIX 1 1 GLN A 62 HIS A 70 1 9
HELIX 2 2 PHE A 72 GLN A 79 1 8
HELIX 3 3 LEU A 84 TYR A 92 1 9
HELIX 4 4 SER A 109 TYR A 125 1 17
HELIX 1 1 PRO B 34 ALA B 42 1 9
HELIX 2 2 SER B 43 PHE B 59 1 17
HELIX 3 3 ARG B 84 SER B 106 1 23
HELIX 4 4 ALA B 137 PHE B 147 1 11
HELIX 5 5 ALA B 157 GLU B 175 1 19
HELIX 6 6 PHE B 202 GLN B 215 1 14
HELIX 7 7 GLY B 236 SER B 244 1 9
ATOM 1 N CYS A 35 -46.772 -32.953 13.444 1.00118.86 N
ATOM 2 CA CYS A 35 -45.589 -33.712 13.063 1.00132.02 C
ATOM 3 C CYS A 35 -45.956 -34.934 12.237 1.00141.34 C
ATOM 4 O CYS A 35 -47.000 -35.548 12.450 1.00140.11 O
SEQRES = 1 A 132 ALA SER ALA LYS GLU LEU ALA CYS GLN GLU ILE THR VAL
SEQRES = 2 A 132 PRO LEU CYS LYS GLY ILE GLY TYR ASN TYR THR TYR MET
SEQRES = 25 B 316 HIS PHE CYS ALA
ATOM 5 CB CYS A 35 -44.802 -34.155 14.301 1.00137.04 C
ATOM 6 SG CYS A 35 -43.999 -32.812 15.204 1.00163.69 S
ATOM 7 N GLN A 36 -45.100 -35.263 11.277 1.00149.21 N
ATOM 8 CA GLN A 36 -45.159 -36.550 10.594 1.00144.14 C
ATOM 9 C GLN A 36 -43.746 -37.119 10.503 1.00143.70 C
SHEET 1 A 1 CYS A 35 ILE A 38 0
SHEET 2 A 1 ASN A 49 TYR A 52 0
SHEET 1 B 1 GLY B 121 ARG B126 0
SHEET 2 B 1 GLY B 127 GLY B131 0
SHEET 3 B 1 THR B 176 HIS B184 0
您要从中提取文件 keys.txt 中存在的以下密钥:
REMARK
HELIX
SEQRES
SHEET
为此,可以使用以下代码:
#!/usr/bin/python
with open('data.csv', 'r') as sourcefile:
source = sourcefile.read().splitlines()
with open('keys.txt', 'r') as keyfile:
keys = keyfile.read().split()
with open('MyOutFile', 'w') as outfile:
for line in source:
if line.split():
if line.split()[0] in keys:
outfile.write(line + "\n")
outfile.close()
这会将keys.txt中的键提取为:
REMARK = 1 NCBI PDB FORMAT VERSION 6.0
REMARK = 2 NOTE: NCBI-MMDB PDB-Format File derived from ASN.1
REMARK = 3 Refer to original ASN.1 file or PDB file for data records
HELIX 1 1 GLN A 62 HIS A 70 1 9
HELIX 2 2 PHE A 72 GLN A 79 1 8
HELIX 3 3 LEU A 84 TYR A 92 1 9
HELIX 4 4 SER A 109 TYR A 125 1 17
HELIX 1 1 PRO B 34 ALA B 42 1 9
HELIX 2 2 SER B 43 PHE B 59 1 17
HELIX 3 3 ARG B 84 SER B 106 1 23
HELIX 4 4 ALA B 137 PHE B 147 1 11
HELIX 5 5 ALA B 157 GLU B 175 1 19
HELIX 6 6 PHE B 202 GLN B 215 1 14
HELIX 7 7 GLY B 236 SER B 244 1 9
SEQRES = 1 A 132 ALA SER ALA LYS GLU LEU ALA CYS GLN GLU ILE THR VAL
SEQRES = 2 A 132 PRO LEU CYS LYS GLY ILE GLY TYR ASN TYR THR TYR MET
SEQRES = 25 B 316 HIS PHE CYS ALA
SHEET 1 A 1 CYS A 35 ILE A 38 0
SHEET 2 A 1 ASN A 49 TYR A 52 0
SHEET 1 B 1 GLY B 121 ARG B126 0
SHEET 2 B 1 GLY B 127 GLY B131 0
SHEET 3 B 1 THR B 176 HIS B184 0
这将解决您的问题。