这是我在文本文件中的内容:我只想获取此sha1和说明,然后使用前缀和定界符将其解析为一个csv文件,并修剪字符串,然后在“ \”和“ ->”,那么我想获取说明。
+----------------------------------------------------+
| VSCAN32 Ver 2.00-1655 |
| |
| Copyright (c) 1990 - 2012 xxx xxx xxx Inc. |
| |
| Maintained by xxxxxxxxx QA for VSAPI Testing |
+----------------------------------------------------+
Setting Process Priority to NORMAL: Success 1
Successfully setting POL Flag to 0
VSGetVirusPatternInformation is invoked
Reading virus pattern from lpt$vpn.527 (2018/09/25) (1452700)
Scanning samples_extracted\88330686ae94a9b97e1d4f5d4cbc010933f90f9a->(MS Office 2007 Word 4045-1)
->Found Virus [TROJ_FRS.VSN11I18]
Scanning samples_extracted\8d286d610f26f368e7a18d82a21dd68b68935d6d->(Microsoft RTF 6008-0)
->Found Virus [Possible_SMCCVE20170199]
Scanning samples_extracted\a10e5f964eea1036d8ec50810f1d87a794e2ae8c->(ASCII text 18-0)
->Found Virus [Trojan.VBS.NYMAIM.AA]
18 files have been checked.
Found 16 files containing viruses.
(malloc count, malloc total, free total) = (0, 35, 35)
到目前为止,这是我的代码:它仍然输出许多字符串,但我只需要将sha1和说明解析为csv即可,我使用了split,因此可以在“ \”和“->”之间选择sha1确实放了sha1,但没有修剪描述,内容仍然存在
import csv
INPUTFILE = 'input.txt'
OUTPUTFILE = 'output.csv'
PREFIX = '\\'
DELIMITER = '->'
def read_text_file(inputfile):
data = []
with open(inputfile, 'r') as f:
lines = f.readlines()
for line in lines:
line = line.rstrip('\n')
if not line == '':
line = line.split(PREFIX, 1)[-1]
parts = line.split(DELIMITER)
data.append(parts)
return data
def write_csv_file(data, outputfile):
with open(outputfile, 'wb') as csvfile:
csvwriter = csv.writer(csvfile, delimiter=',', quotechar='"',
quoting=csv.QUOTE_ALL)
for row in data:
csvwriter.writerow(row)
def main():
data = read_text_file(INPUTFILE)
write_csv_file(data, OUTPUTFILE)
if __name__ == '__main__':
main()
这是我想要的csv中的内容:sha1和description,但是我的输出文件显示了整个文本文件,但是它过滤了sha1并将其放在一列中
编辑:起初它是可以工作的,但是由于该行多行,因此可以将其放置在csv文件中,请问任何答案吗?
Scanning samples_extracted\0191a23ee122bdb0c69008971e365ec530bf03f5
- Invoice_No_94497.doc->Found Virus [Trojan.4FEC5F36]->(MIME 6010-0)
- Found 1/3 Viruses in samples_extracted\0191a23ee122bdb0c69008971e365ec530bf03f5
答案 0 :(得分:0)
更改最少-您可以使用以下部分代码:
for line in lines:
line = line.rstrip('\n')
if not line == '' and DELIMITER in line and not "Found" in line: # <---
line = line.split(PREFIX, 1)[-1]
parts = line.split(DELIMITER)
但是我更喜欢使用正则表达式:
import re
for line in lines:
line = line.rstrip('\n')
if re.search(r'[a-zA-Z0-9]{40}->\(', line): # <----
line = line.split(PREFIX, 1)[-1]
parts = line.split(DELIMITER)
data.append(parts)
结果将是:
cat output.csv
"88330686ae94a9b97e1d4f5d4cbc010933f90f9a","(MS Office 2007 Word 4045-1)"
"8d286d610f26f368e7a18d82a21dd68b68935d6d","(Microsoft RTF 6008-0)"
"a10e5f964eea1036d8ec50810f1d87a794e2ae8c","(ASCII text 18-0)"
答案 1 :(得分:0)
import re
import pandas as pd
a=open("inputfile","a+")
storedvalue=[]
for text in a.readlines():
matched_words=(re.search(r'\d.+?->\(.*?\)',text))
if matched_words!=None:
matched_words=matched_words.group()
matched_words=matched_words.split("->")
storedvalue.append(tuple(matched_words))
dataframe=pd.DataFrame(storedvalue,columns=["hashvalue","description"])
dataframe.to_csv("output.csv")
结果将是:
hashvalue description
88330686ae94a9b97e1d4f5d4cbc010933f90f9a (MS Office 2007 Word 4045-1)
8d286d610f26f368e7a18d82a21dd68b68935d6d (Microsoft RTF 6008-0)
10e5f964eea1036d8ec50810f1d87a794e2ae8c (ASCII text 18-0)