我刚刚编写了python代码,将大约700个文本文件中的数据提取到一个名为out_data.txt的文件中
out_data.txt文件的内容如下所示:
datetime,V_1,V_2,V_3,V_4,V_5,V_6,V_7
2013-03-17 18:01:48.372,100,884,776,009,6553,ffff,987
2013-03-17 18:02:03.828,876,632,887,008,5423,879,443
2013-05-17 20:13:52.488,543,987,233,112,098,344,123
2013-08-17 23:09:08.171,667,9887,9897,09876,0987,098,0987
2013-01-17 35:06:04.172,267,987,6897,9876,1287,3498,2987
.....
out_data.txt文件中总共有5783374行,每行(标头之后)都以日期时间值开头
但是,我遇到的问题是,我编写的代码从每个单独的文件中提取数据,并将其添加到我的out_data.txt文件中,但是如上所示,这些行不是按日期时间顺序排列的。 我希望按日期时间顺序排列线条,因为我需要绘制此数据。
任何帮助将不胜感激。
import re #regular expressions
import glob #file management and reading
if __name__ == "__main__": #opening for python
all_header=[] #list declaration
all_values=[] #list declaration
i=0
with open('out_data.txt', 'w') as of: #output file
for infile in glob.glob("/Users/name/Desktop/raw_data/*.txt"): #input file
with open(infile) as fobj:
print "processing file {}".format(infile)
for line in fobj:
data = line.split() #split each line into individual tokens
if len(data)==2 and re.search(r'(\d+-\d+-\d+)', data[0]): #regular expression to identify date and time
header=['datetime'] #column name datetime
values=[data[0]+" "+data[1]] #date+time as one value
else:
header=[d for d in data if data.index(d)%2==0]
values=[d for d in data if data.index(d)%2!=0]
all_header.extend(header)
all_values.extend(values)
if not header:
if i==0:
of.write(','.join(all_header))
i=i+1
of.write("\n")
of.write(','.join(all_values))
all_header = []
all_values = []
of.write("\n")
of.write(','.join(all_values))
根据我上面给出的示例数据,我的预期结果将是
datetime,V_1,V_2,V_3,V_4,V_5,V_6,V_7
2013-01-17 35:06:04.172,267,987,6897,9876,1287,3498,2987
2013-03-17 18:01:48.372,100,884,776,009,6553,ffff,987
2013-03-17 18:02:03.828,876,632,887,008,5423,879,443
2013-05-17 20:13:52.488,543,987,233,112,098,344,123
2013-08-17 23:09:08.171,667,9887,9897,09876,0987,098,0987
但是,当然,我无法真正弄清楚如何在代码中包括sort元素,或者是否还有其他方法可以做到这一点。
谢谢!
答案 0 :(得分:0)
您可以使用熊猫。一个简单的示例如下:
import pandas as pd
import glob
df_list = []
for infile in glob.glob("/Users/name/Desktop/raw_data/*.txt"):
df_list.append(pd.read_csv(infile,parse_dates=['datetime']))
df = pd.concat(df_list).sort_values(by='datetime')
df.to_csv('out_data.txt',index=False)
答案 1 :(得分:0)
您可以通过1号键执行普通(字典顺序)排序 日期/时间的格式设置为固定长度。
请尝试以下操作:
import csv
with open("out_data.txt", "r") as f:
reader = csv.reader(f, delimiter=",")
header = next(reader)
sortedlist = sorted(reader, key = lambda x: x[0])
with open("sorted.txt", "w") as f:
writer = csv.writer(f, lineterminator="\n")
writer.writerow(header)
writer.writerows(sortedlist)
将上面的片段嵌入到代码中很容易。
作为替代方案,您可以使用bash
说:
head -1 out_data.txt > sorted.txt
tail +2 out_data.txt | sort -t, -k1 >> sorted.txt
希望这会有所帮助。