Question

我有不同的数据集，其中一些数据是5分钟/ 15分钟或30分钟的间隔。有100多个这样的文件（格式不同 - .dat，.txt，.csv等）我想使用Python过滤掉所有这些文件的每小时数据。我是新手使用熊猫，当我正在尝试学习图书馆时，任何帮助都会得到很大的帮助。

Date        Time    Point_1
27/3/2017   0:00:00 13.08
27/3/2017   0:05:00 12.96
27/3/2017   0:10:00 13.3
27/3/2017   0:15:00 13.27
27/3/2017   0:20:00 13.15
27/3/2017   0:25:00 13.14
27/3/2017   0:30:00 13.25
27/3/2017   0:35:00 13.26
27/3/2017   0:40:00 13.24
27/3/2017   0:45:00 13.43
27/3/2017   0:50:00 13.23
27/3/2017   0:55:00 13.27
27/3/2017   1:00:00 13.19
27/3/2017   1:05:00 13.17
27/3/2017   1:10:00 13.1
27/3/2017   1:15:00 13.06
27/3/2017   1:20:00 12.99
27/3/2017   1:25:00 13.08
27/3/2017   1:30:00 13.04
27/3/2017   1:35:00 13.06
27/3/2017   1:40:00 13.07
27/3/2017   1:45:00 13.07
27/3/2017   1:50:00 13.02
27/3/2017   1:55:00 13.13
27/3/2017   2:00:00 12.99

Answer 1

您可以首先使用参数RewriteRule ^/email/footer\.png\)$ /email/footer.png的{{3}}将parse_dates和date转换为time：

datetime

然后read_csv并汇总resample或import pandas as pd from pandas.compat import StringIO temp=u"""Date Time Point_1 27/3/2017 0:00:00 13.08 27/3/2017 0:05:00 12.96 27/3/2017 0:10:00 13.3 27/3/2017 0:15:00 13.27 27/3/2017 0:20:00 13.15 27/3/2017 0:25:00 13.14 27/3/2017 0:30:00 13.25 27/3/2017 0:35:00 13.26 27/3/2017 0:40:00 13.24 27/3/2017 0:45:00 13.43 27/3/2017 0:50:00 13.23 27/3/2017 0:55:00 13.27 27/3/2017 1:00:00 13.19 27/3/2017 1:05:00 13.17 27/3/2017 1:10:00 13.1 27/3/2017 1:15:00 13.06 27/3/2017 1:20:00 12.99 27/3/2017 1:25:00 13.08 27/3/2017 1:30:00 13.04 27/3/2017 1:35:00 13.06 27/3/2017 1:40:00 13.07 27/3/2017 1:45:00 13.07 27/3/2017 1:50:00 13.02 27/3/2017 1:55:00 13.13 27/3/2017 2:00:00 12.99""" #after testing replace 'StringIO(temp)' to 'filename.csv' df = pd.read_csv(StringIO(temp), sep="\s+", #alternatively delim_whitespace=True index_col=[0], parse_dates={'Dates':['Date','Time']})，sum ......：

mean

df1 = df.resample('1H')['Point_1'].first().reset_index()
print (df1)
                Dates  Point_1
0 2017-03-27 00:00:00    13.08
1 2017-03-27 01:00:00    13.19
2 2017-03-27 02:00:00    12.99

first和groupby的另一种解决方案：

df1 = df.resample('1H')['Point_1'].sum().reset_index()
print (df1)
                Dates  Point_1
0 2017-03-27 00:00:00   158.58
1 2017-03-27 01:00:00   156.98
2 2017-03-27 02:00:00    12.99

或者可能需要：

df1 = df.groupby(pd.Grouper(freq='1H')).first().reset_index()
print (df1)
                Dates  Point_1
0 2017-03-27 00:00:00    13.08
1 2017-03-27 01:00:00    13.19
2 2017-03-27 02:00:00    12.99

Answer 2

import pandas as pd

df = pd.read_table('sample.txt', delimiter='\s+')  # Your sample data
df['dt'] = pd.to_datetime(df['Date'] + ' ' + df['Time'])

print df.set_index('dt').resample('1H').asfreq().reset_index(drop=True)


        Date     Time  Point_1
0  27/3/2017  0:00:00    13.08
1  27/3/2017  1:00:00    13.19
2  27/3/2017  2:00:00    12.99

Answer 3

这与你想要做的事情类似。这适用于csv文件，也适用于.txt文件。如果所有数据的顺序相同，则可以非常轻松地编写for循环以增加计数，当它达到13 out时，将该值放入xaxis列表中。但是，如果您的数据没有采用与增加5分钟相同的模式，则需要按照另一个指标对其进行排序，以免让您头疼。这很容易在matplotlib中使用pythons sort函数完成。 https://docs.scipy.org/doc/numpy/reference/generated/numpy.sort.html

#opens the file and reads in the raw data and 
#cleans up data so it is readable
file=open("file_name","r")
data=file.read()
data=data.replace(" ",",")
#when reading in the data the 3rd index saved a 
#value \r so this is necessary to use a float type
data=data.split("\r")
#x and y axis dictionary
xaxis = []
#for loop for getting the time and 
for index in range(0,len(data)):
 xaxis=data[index][0]
#if data is in range remove data that has a divide by 0 error
for index in range(0, len(data)):
  if len(data[index]) == 0:
    del(data[index])
    continue
for index in range(0,len(data)):
print "lines",index, "-",data[index]
data[index]=data[index].split(",")
data[index][1]=int(data[index][1])

Answer 4

全部谢谢!!

这是我完整的代码，用于读取所有文件夹中的所有文件，并将过滤后的数据（仅限每小时）写入新的csv文件。我不经常编码，所以我的编程技巧不是很好。我确信有一种更好的方法可以做同样的事情，我不是只谈论熊猫库，而是整个代码。我希望我可以用更好的东西替换我的if循环。这主要是为了防止列表超出索引（类似于k = k-1，但不知道放在哪里。）我的代码工作顺利。如果有更好的发烧友，请加入！

我的文件夹结构如下：Building1是包含20个子文件夹的主文件夹，每个子文件夹包含19-20个文件。

干杯

import os
import pandas as pd
folderarray = []
filearray =[]
patharray =[]

path = "C:\Users\Priyanka\Documents\R_Python\OneHourInterval\Building1"
os.chdir(path)


for foldername in os.listdir(os.getcwd()):
    folderarray.append(foldername)
    print folderarray

for i in range(0,len(folderarray)):
    filename = os.listdir(path+"\\"+folderarray[i])
    filearray.append(filename)

for j in range(0,len(folderarray)):
    for k in range(0,len(filearray)):
        if k < len(filearray[j]):
            df1 = pd.read_csv(path+"""\\"""+folderarray[j]+"""\\"""+filearray[j][k], sep=",", header=None)
            df = df1[2:len(df1)]
            df = df[[0,1,2,3,4,5]]
            df.columns = ['Date','Time','KWH','OCT','RAT','CO2']
            dftime = pd.to_datetime(df['Time'])    
            df['dt'] = pd.to_datetime(df['Date'] + ' ' + df['Time'])
            df = df.set_index('dt').resample('1H')['KWH','OCT','RAT','CO2'].first().reset_index()
            print df
            print path+"""\\"""+folderarray[j]+"""\\"""+filearray[j][k]
            str = filearray[j][k]
            newfilename = str.replace(".dat",".csv")
            df.to_csv(path+"""\\"""+folderarray[j]+"""\\"""+newfilename)

过滤每小时数据python

4 个答案: