我想使用python减少文件中的两行数据以创建新文件。我用pandas和numpy进行处理,但是pandas的处理时间很长,甚至需要几个小时,而numpy可以是两三分钟,总共超过100万的数据, 作为数据的一部分:
33,Jogging,49105962326000,-0.6946377,12.680544,0.50395286;
33,Jogging,49106062271000,5.012288,11.264028,0.95342433;
33,Jogging,49106112167000,4.903325,10.882658,-0.08172209;
我的熊猫代码如下:
import pandas as pd
import numpy as np
import time
time1 = time.time()
file = open('WISDM_ar_v1.1_raw.txt','r')
dataset = file.readlines()
list1 = []
for i in range(len(dataset)-1):
dataset[i] = dataset[i].rstrip('\n')
dataset[i] = dataset[i].rstrip(';')
dataset[i] = dataset[i].split(",")
if len(dataset[i])==6:
#list1为处理后的数据
list1.append(dataset[i])
array1 = np.array(list1)
#newline两行之间按什么分割 delimiter列之间按什么分割
np.savetxt("aa.txt", array1, fmt="%s",newline='\r\n', delimiter=",")
column_names = ['user-id', 'activity', 'timestamp', 'x-axis', 'y-axis', 'z-axis']
dataset1 = pd.read_csv('aa.txt',names=column_names, header=None)
df = pd.DataFrame(dataset1)
df1 = pd.DataFrame(columns=column_names)
for i in range(0,len(dataset1)-1):
data = dataset1.loc[[i]]
if dataset1.loc[i+1, 'activity']==dataset1.loc[i,'activity']:
data.loc[i,'user-id'] = dataset1.loc[i,'user-id']
data.loc[i,'x-axis'] = dataset1.loc[i+1,'x-axis']-dataset1.loc[i,'x-axis']
data.loc[i,'y-axis'] = dataset1.loc[i+1,'y-axis'] - dataset1.loc[i,'y-axis']
data.loc[i,'z-axis'] = dataset1.loc[i+1,'z-axis'] - dataset1.loc[i,'z-axis']
df1 = df1.append(data, ignore_index=True)
df1.to_csv('new_data.txt', mode='a',sep=',', header=False, index=False)
我想知道为什么会这样。我写的熊猫代码有什么错误吗?非常感谢你!