如何使用python处理InfluxDB减少数据清理时间

时间:2018-07-30 06:46:45

标签: python influxdb collectd

我是Python和InfluxDB的新手,现在执行一些智能的操作和维护工作。 InfluxDB中的所有数据均来自收集。我打电话:

influx -execute 'SELECT * FROM "cpu_value"' -database "collectd" -precision=rfc3339 > cpu_value.csv

因此,我得到有关cpu_value信息的cpu_value.csv(约3.7G)。然后我打电话:

df = pd.read_csv(filename, header=0, sep='\s+', skiprows=1, low_memory=False)

读取cpu_value.csv并获得返回df。 我通过调用python的while子句来清除变量df中包含的所有记录,并读取每条记录并计算某些记录的值的总和,但是df中的记录太多(大约600万条),并且花费了很多时间。 有什么更好的方法可以减少清洁时间?预先感谢。

df_user = pd.DataFrame(columns=['datetime', 'ns', 'host', 'instance', 'type', 'type_instance', 'value'])
df_system = df_user
df_wait = df_user
df_nice = df_user
df_interrupt = df_user
df_softirq = df_user
df_steal = df_user
df_idle = df_user
loop = instance_num * 8
#print('cleanse record No. %d...' % line)
print('cleanse record No. {:0,d}'.format(line))
while (line < df_rows):
    line = line + 1
    if (line % 10000) == 0:
        #print('cleanse record No. %d...' % line)
        print('cleanse record No. {:0,d}'.format(line))
    remainder = line % loop
    if (0 < remainder) and (remainder <= instance_num): 
        df_user = df_user.append(df2.loc[line], ignore_index=True)
    elif (instance_num <= remainder) and (remainder <= instance_num*2):
        df_system = df_system.append(df2.loc[line], ignore_index=True)
    elif (instance_num*2 <=remainder) and (remainder <= instance_num*3):
        df_wait = df_wait.append(df2.loc[line], ignore_index=True)
    elif (instance_num*3 <=remainder) and (remainder <= instance_num*4):
        df_nice = df_nice.append(df2.loc[line], ignore_index=True)
    elif (instance_num*4 <=remainder) and (remainder <= instance_num*5):
        df_interrupt = df_interrupt.append(df2.loc[line], ignore_index=True)
    elif (instance_num*5 <=remainder) and (remainder <= instance_num*6):
        df_softirq = df_softirq.append(df2.loc[line], ignore_index=True)
    elif (instance_num*6 <=remainder) and (remainder <= instance_num*7):
        df_steal = df_steal.append(df2.loc[line], ignore_index=True)
    elif ((instance_num*7 <=remainder) and (remainder < instance_num*8)) or (remainder == 0):
        df_idle = df_idle.append(df2.loc[line], ignore_index=True)
#print('complete cleansing %d records!' % line)
print('complete cleansing {:0,d}'.format(line))

print('save classified records in csv format...')
df_user.to_csv('out/cpu_user.csv')
df_system.to_csv('out/cpu_system.csv')
df_wait.to_csv('out/cpu_wait.csv')
df_nice.to_csv('out/cpu_nice.csv')
df_interrupt.to_csv('out/cpu_interrupt.csv')
df_softirq.to_csv('out/cpu_softirq.csv')
df_steal.to_csv('out/cpu_steal.csv')
df_idle.to_csv('out/cpu_idle.csv')
#x, y = calc_instant_sum(df_user)
size = df_user.shape[0] // instance_num
df_rows = (df_user.shape[0] // instance_num) * instance_num
line = 0
x = np.array(range(size))
y = np.zeros(size)
val = pd.to_numeric(df_user['value'])
while (line < df_rows):
    val_index = line % instance_num
    y_index = line // instance_num
    if val_index == 0:
        y [y_index]= val[line]
    else:
        y[y_index] = y[y_index] + val[line]
    line = line + 1
print('complete data cleansing!') 

0 个答案:

没有答案