我是Python新手, 我正在尝试处理具有多列的CSV文件,第一列是服务器名称,其余列是有关服务器的信息。
示例数据:
**Client Name,Job Duration,Job File Count,Throughput (KB/sec),Job Primary ID,Schedule/Level Type,Master Server,Media Server,Policy Name,Job Type,Job Attempt Count,Schedule Name,Protected Data Size(MB),Accelerator Enabled,Job Start Time,Accelerator Data Sent (MB),Accelerator Savings(MB),Accelerator Optimization %,Job End Time,Deduplication Enabled,Post Deduplication Size(MB),Deduplication Savings (MB),Total Optimization % (Accelerator + Deduplication),Job Status,Status Code,Policy Keyword,Storage Unit Name**
ambgsun39,00:12:00,0,0,37525,Full,MYPVLXBAKCLU,ambglx24,C1_F4_AD_SHS_COMPUTRON_DGLP_COLD,Backup,1,Monthly_Full,0,No,"Aug 1, 2015 3:00:00 AM",-,0,0,"Aug 1, 2015 3:12:00 AM",No,0,0,0,Successful,0,-,stu_PDC99002_IP_ambglx24
ambglx21,00:03:02,0,0,37527,Full,MYPVLXBAKCLU,ambglx21,C2_F6_AM_REB_CFS,Backup,1,UNKNOWN,0,No,"Aug 1, 2015 3:00:00 AM",-,0,0,"Aug 1, 2015 3:03:02 AM",No,0,0,0,Successful,0,-,UNKNOWN
ambglx21,00:03:42,0,0,37528,Full,MYPVLXBAKCLU,ambglx21,C2_F6_AM_REB_CFS_DB,Backup,1,UNKNOWN,0,No,"Aug 1, 2015 3:00:00 AM",-,0,0,"Aug 1, 2015 3:03:42 AM",No,0,0,0,Successful,0,-,UNKNOWN
ambgsun39,00:11:02,1,"95,543",37531,User backup,MYPVLXBAKCLU,ambglx24,C1_F4_AD_SHS_COMPUTRON_DGLP_COLD,Backup,1,Default-Application-Backup,"60,834.78",No,"Aug 1, 2015 3:00:24 AM",-,0,0,"Aug 1, 2015 3:11:26 AM",No,"60,834.78",0,0,Successful,0,-,stu_PDC99002_IP_ambglx24
dvmpwin040,00:01:41,"170,305","336,398",37532,Full,MYPVLXBAKCLU,ambglx21,C2_F2_AM_SHS_FTP,Backup,1,Daily_Full,"29,894.78",Yes,"Aug 1, 2015 3:00:25 AM","1,494.74","28,400.04",95,"Aug 1, 2015 3:02:06 AM",No,"29,894.78",0,0,Successful,0,-,stu_PDC99001_IP_ambglx21
dvmpwin048,00:04:57,"44,133","515,413",37535,Full,MYPVLXBAKCLU,ambglx21,C2_F2_AM_SHS_Crystal_Reports,Backup,1,Daily_Full,"145,440.72",Yes,"Aug 1, 2015 3:00:35 AM","5,817.63","139,623.09",96,"Aug 1, 2015 3:05:32 AM",No,"1
同一服务器有多个条目,我需要提取列作业持续时间,作业文件数,吞吐量,受保护数据大小,并获取具有唯一服务器名称条目的每列的平均值。
结束状态:
Client Name, Average Job Duration, Average job File count, Average Throughput, Average Protected Data size
ambglx21, 00:10:00, 25000, 50000, 25000
我只能弄清楚它的一部分。
import csv
from collections import defaultdict
csv_data = defaultdict(list)
for i, row in enumerate(csv.reader(open('data.csv', 'rt'))):
if not i or not row:
continue
client_name,job_duration,job_file_count,throughput,job_primary_id,schedule,master_server,nedia_server,policy_name,job_type,job_attempt_count,schedule_name,protected_data_size,accelerator_enabled,job_start_time,accelerator_data_sent,acceleartor_savings,accelerator_optimisation,job_end_time,deduplication_enabled,post_deduplicaiton_size,deduplication_savings,total_optimisation,job_status,status_code,policy_keyword,storage_unit_name = row
throughput = int(throughput.replace(',', ''))
protected_data_size = float(protected_data_size.replace(',', ''))
csv_data[client_name].append(int(throughput))
#csv_data[client_name].append(job_duration)
#csv_data[client_name].append(float(protected_data_size))
for client_name, throughputs in csv_data.items():
throughputs = int(int(sum(throughputs)) / int(len(throughputs)) / 1024)
#protected_data = int(int(sum(protected_data)) / int(len(protected_data)) / 1024)
print(client_name, throughputs)
我只能获得吞吐量使用字典。我不确定如何附加其余数据并进行处理。
当前脚本输出:
bvmpwin017 1145
ambgjmp01 3620
ambglx22 8
非常感谢您的帮助,我们非常感谢任何见解。
答案 0 :(得分:0)
我不经常使用csv,但这看起来很不错。我认为你很好地描述了你的问题但是却无法从csv文件中获取所需的所有数据。时代和其他一些问题并非微不足道。我希望这能为您提供一种方法。
顺便说一下,这里对数据的自定义解释应该通过csv
模块的一些自定义使用来完成,但我没有相关经验,也没有看到如何使用它。也许其他人可以展示如何做到这一点。
代码中的注释有望解释它是如何工作的。
import csv
import datetime
# tell the application how to interpret and do averages with special column types
ELAPSED_TIME_REFERENCE = datetime.datetime.strptime('0:0:0', '%H:%M:%S')
def get_elapsed(s):
h, m, s = s.split(':')
delta = datetime.timedelta(days=0, hours=int(h), minutes=int(m), seconds=int(s))
return delta.total_seconds()
def get_int(s):
return int(s.replace(',', ''))
def get_float(s):
return float(s.replace(',', ''))
def numerical_average(values):
return float(sum(values))/max(len(values), 1)
def elapsed_average(elapsed_times_s):
average_elapsed_s = numerical_average(elapsed_times_s)
delta = datetime.timedelta(seconds=average_elapsed_s)
return delta
CONVERTER = 'converter'
AVERAGE = 'average'
HEADER_TO_TOOLS = {'Job Duration': {CONVERTER: get_elapsed,
AVERAGE: elapsed_average},
'Job File Count': {CONVERTER: get_int,
AVERAGE: numerical_average},
'Throughput (KB/sec)': {CONVERTER: get_float,
AVERAGE: numerical_average},
'Protected Data Size(MB)': {CONVERTER: get_float,
AVERAGE: numerical_average}}
def interpret_string(header, s):
tools = HEADER_TO_TOOLS.get(header)
if tools:
return tools[CONVERTER](s)
return s # don't interpret if no interpreter exists
# collect all data as: {client_name: {header1: list_of_values, header2: list_of_values}}
data_dict = {}
with open('data.csv') as f:
reader = csv.reader(f)
headers = tuple(x.strip() for x in next(reader)) # first row
for row in reader:
client_name = row[0]
this_client_data = data_dict.setdefault(client_name, {header: [] for header in headers})
for header, s in zip(headers, row):
s = s.strip()
this_client_data[header].append(interpret_string(header, s))
# print the results
output_headers = ['**Client Name', 'Job Duration', 'Job File Count', 'Throughput (KB/sec)', 'Protected Data Size(MB)']
# print the headers first
print(', '.join(output_headers))
# print the client name and averages for each client
for server_name, server_data in data_dict.items():
print_items = []
for output_header in output_headers:
header_values = server_data[output_header]
tools = HEADER_TO_TOOLS.get(output_header)
if tools:
print_items.append(str(tools[AVERAGE](header_values)))
else:
print_items.append(header_values[0]) # they should all be the same if not numerical
print(', '.join(print_items))
结果:
**Client Name, Job Duration, Job File Count, Throughput (KB/sec), Protected Data Size(MB)
ambglx21, 0:03:22, 0.0, 0.0, 0.0
dvmpwin048, 0:04:57, 44133.0, 515413.0, 145440.72
ambgsun39, 0:12:00, 0.0, 0.0, 0.0
ambgsun39, 0:11:02, 1.0, 95543.0, 60834.78
dvmpwin040, 0:01:41, 170305.0, 336398.0, 29894.78