按时间和列匹配和计算两个文件

时间:2015-10-04 01:31:38

标签: python datetime pandas dataframe

我试图通过使用Pandas查找行来计算出这两个csv文件:

File1中:

---------------------------------------------------------------
Day  Mth    Yr    Hr  Min Loc_Nu   Lat      Long      Rain
---------------------------------------------------------------
1     1     2005  9   30  12456   -34.9211  138.6216  Yes

1     1     2005  9   45  12375   -34.9211  138.6216  Yes

1    12     1998  17  5   12376 -34.9211  138.6216  No

File2:

----------------------------------------------------------------------
date              12375    12376    12456  
----------------------------------------------------------------------
1/1/2005 9:30     NA       NA      0.2                  

1/1/2005 10:00    NA       0       NA                   

1/1/2005 10:30    0        NA      0.6  
  1. 写一个新文件,其中Loc_Nufile1中的时间与file2
  2. 中的标题和时间相匹配
  3. 提取这些匹配项的数量NA0>0
  4. 到目前为止,这是我的脚本:

    import pandas as pd
    
    file1 = pd.read_csv(r'E:\project\test\file1.csv')
    print file1
    file2 = pd.read_csv(r'E:\project\test\file2.csv')
    print file2
    

    我必须去目录。如果没有它,我无法打印file1和file2。

1 个答案:

答案 0 :(得分:0)

你可以尝试这个解决方案,如果你不明白,你可以在评论中提问:

import pandas as pd
import io, datetime

df = pd.read_csv(r'E:\project\test\file1.csv')
df1 = pd.read_csv(r'E:\project\test\file2.csv')

#set column date to datetime
df1["date"] = pd.to_datetime(df1["date"], format="%d/%m/%Y %H:%M")
#set column date to index, stack columns to rows(not drop NaN values), reset index
df1 = df1.set_index("date").stack(dropna=False).reset_index()
#set column names
df1.columns = ['date','Loc_Nu', 'values']
#set column type to int for merging
df1['Loc_Nu'] = df1['Loc_Nu'].astype(int)

#set datetime column to column date, delete these columns
df['date'] = df[['Yr', 'Mth', 'Day', 'Hr', 'Min']].apply(lambda s : datetime.datetime(*s),axis = 1)
df = df.drop(['Yr', 'Mth', 'Day', 'Hr', 'Min'], axis=1)
print df
#   Loc_Nu      Lat      Long Rain                date
#0   12456 -34.9211  138.6216  Yes 2005-01-01 09:30:00
#1   12375 -34.9211  138.6216  Yes 2005-01-01 09:45:00
#2   12376 -34.9211  138.6216   No 1998-12-01 17:05:00
print df1
#                 date  Loc_Nu  values
#0 2005-01-01 09:30:00   12375     NaN
#1 2005-01-01 09:30:00   12376     NaN
#2 2005-01-01 09:30:00   12456     0.2
#3 2005-01-01 10:00:00   12375     NaN
#4 2005-01-01 10:00:00   12376     0.0
#5 2005-01-01 10:00:00   12456     NaN
#6 2005-01-01 10:30:00   12375     0.0
#7 2005-01-01 10:30:00   12376     NaN
#8 2005-01-01 10:30:00   12456     0.6

#intersection df and df1 by columns date and Loc_Nu
df2 = pd.merge(df, df1, on=['date', 'Loc_Nu'])
#if you want, you can reorder columns
df2 = df2[['date','Loc_Nu','Lat','Long','Rain','values']]
print df2
#                 date  Loc_Nu      Lat      Long Rain  values
#0 2005-01-01 09:30:00   12456 -34.9211  138.6216  Yes     0.2

#what are dataframes and count them by matches 0, >0, NaN
print df2.loc[df2['values'] == 0 ]
print len(df2.loc[df2['values'] == 0 ].index)
print df2.loc[df2['values'] > 0 ]
print len(df2.loc[df2['values'] > 0 ].index)
print df2.loc[df2['values'].isnull()]
print len(df2.loc[df2['values'].isnull()].index)

#Empty DataFrame
#Columns: [date, Loc_Nu, Lat, Long, Rain, values]
#Index: []
#0
#                 date  Loc_Nu      Lat      Long Rain  values
#0 2005-01-01 09:30:00   12456 -34.9211  138.6216  Yes     0.2
#1
#Empty DataFrame
#Columns: [date, Loc_Nu, Lat, Long, Rain, values]
#Index: []
#0