我是Python和InfluxDB的新手,现在执行一些智能的操作和维护工作。 InfluxDB中的所有数据均来自收集。我打电话:
influx -execute 'SELECT * FROM "cpu_value"' -database "collectd" -precision=rfc3339 > cpu_value.csv
因此,我得到有关cpu_value信息的cpu_value.csv(约3.7G)。然后我打电话:
df = pd.read_csv(filename, header=0, sep='\s+', skiprows=1, low_memory=False)
读取cpu_value.csv并获得返回df。 我通过调用python的while子句来清除变量df中包含的所有记录,并读取每条记录并计算某些记录的值的总和,但是df中的记录太多(大约600万条),并且花费了很多时间。 有什么更好的方法可以减少清洁时间?预先感谢。
df_user = pd.DataFrame(columns=['datetime', 'ns', 'host', 'instance', 'type', 'type_instance', 'value'])
df_system = df_user
df_wait = df_user
df_nice = df_user
df_interrupt = df_user
df_softirq = df_user
df_steal = df_user
df_idle = df_user
loop = instance_num * 8
#print('cleanse record No. %d...' % line)
print('cleanse record No. {:0,d}'.format(line))
while (line < df_rows):
line = line + 1
if (line % 10000) == 0:
#print('cleanse record No. %d...' % line)
print('cleanse record No. {:0,d}'.format(line))
remainder = line % loop
if (0 < remainder) and (remainder <= instance_num):
df_user = df_user.append(df2.loc[line], ignore_index=True)
elif (instance_num <= remainder) and (remainder <= instance_num*2):
df_system = df_system.append(df2.loc[line], ignore_index=True)
elif (instance_num*2 <=remainder) and (remainder <= instance_num*3):
df_wait = df_wait.append(df2.loc[line], ignore_index=True)
elif (instance_num*3 <=remainder) and (remainder <= instance_num*4):
df_nice = df_nice.append(df2.loc[line], ignore_index=True)
elif (instance_num*4 <=remainder) and (remainder <= instance_num*5):
df_interrupt = df_interrupt.append(df2.loc[line], ignore_index=True)
elif (instance_num*5 <=remainder) and (remainder <= instance_num*6):
df_softirq = df_softirq.append(df2.loc[line], ignore_index=True)
elif (instance_num*6 <=remainder) and (remainder <= instance_num*7):
df_steal = df_steal.append(df2.loc[line], ignore_index=True)
elif ((instance_num*7 <=remainder) and (remainder < instance_num*8)) or (remainder == 0):
df_idle = df_idle.append(df2.loc[line], ignore_index=True)
#print('complete cleansing %d records!' % line)
print('complete cleansing {:0,d}'.format(line))
print('save classified records in csv format...')
df_user.to_csv('out/cpu_user.csv')
df_system.to_csv('out/cpu_system.csv')
df_wait.to_csv('out/cpu_wait.csv')
df_nice.to_csv('out/cpu_nice.csv')
df_interrupt.to_csv('out/cpu_interrupt.csv')
df_softirq.to_csv('out/cpu_softirq.csv')
df_steal.to_csv('out/cpu_steal.csv')
df_idle.to_csv('out/cpu_idle.csv')
#x, y = calc_instant_sum(df_user)
size = df_user.shape[0] // instance_num
df_rows = (df_user.shape[0] // instance_num) * instance_num
line = 0
x = np.array(range(size))
y = np.zeros(size)
val = pd.to_numeric(df_user['value'])
while (line < df_rows):
val_index = line % instance_num
y_index = line // instance_num
if val_index == 0:
y [y_index]= val[line]
else:
y[y_index] = y[y_index] + val[line]
line = line + 1
print('complete data cleansing!')