最近几天一直在这里阅读几篇非常好的帖子,不幸的是,我现在轮到我了,因为我有以下问题:
我从csv读取大型数据帧(df),包括。 c.20列和所有类型的变量incl。 float,object,string,integer和datetime。无法识别日期时间,因此我首先转换了相应的对象列 - 让我们将其称为“小狗”。并在另一栏中将其标准化(因为我只需要每日水平进行进一步处理):
df.pub = pd.to_datetime(df.pub, format='%d/%m/%Y %H:%M')
df['pub_day'] = pd.DatetimeIndex(df.pub).normalize()
df.set_index(['pub']) # indexing in df remained accurate
这一切都很好。现在,我执行了几个以' pub_day'为条件的各种其他列的groupby操作(= countifs)。同样,这些都是正确的,所有正确的聚合数字。即:
df['counted_if'] = df['some_no'].groupby(df['pub_day']).transform('sum')
我没有连续的酒吧'或者' pub_day'专栏,意思是我的csv中有些日子完全缺失,有些日子有多次出现。
现在出现问题: 我接下来要做的是将正确计算的groupby操作作为连续格式的新数据帧df2中的新列编写,这意味着在' pup_day'中添加缺少日期的行。并删除第二次包含特定日期的行。仅供参考:当我在第一个df中为groupby操作添加新列时,groupby值仍然是正确的,并且只是在'pub_day'中的一天时重复。不止一次出现。
我尝试了很多东西,并且还阅读了很多关于reindex incl的内容。 fill_value,set_index等等,但我仍然无法理解。
因此,如何:(1)将列[' count-if']导出到第二个数据帧? (2)设置基于日期的日期时间列' pup-day'作为df2指数? (3)删除此1列/ 1索引df2中的重复条目? (4)以某种方式操纵指数,即所有日子都出现。空虚的日子,所以我最终每天都有一个离散的时间序列?
说真的,我自己知道所有步骤(1) - (4)但不知何故,它们似乎只在独立测试时工作...我的组合代码很乱,有很多行并且给出了索引错误。 ...这有什么快速的5-10种解决方法吗?
- > df数据样本(某些数字):
[1][2]...['some_no'][18] ['pub'] [20]['pub_day']['counted_if']
ab xy 20 abc 02/02/2002 13:03 2 02/02/2002 24
de it 4 aso 02/02/2002 11:08 32 02/02/2002 24
hi as 3 asd 01/02/2002 17:30 8 01/02/2002 3
zu lu 4 akr 31/01/2002 11:03 12 31/01/2002 5
da fu 1 lts 31/01/2002 09:03 14 31/01/2002 5
la di 6 unu 26/01/2002 08:07 3 26/01/2002 6
.. .. .. .. .......... .. .......... ..
- >它在df2中应该是什么样子:
['counted_if']
02/02/2002 24
01/02/2002 3
31/01/2002 5
30/01/2002 0 (or NaN or whatever..)
29/01/2002 0
28/01/2002 0
27/01/2002 0
26/01/2002 6
.....
df.pub = pd.to_datetime(df.pub, format='%d/%m/%Y %H:%M')
df['pub_day'] = pd.DatetimeIndex(df.pub).normalize()
df.set_index(['pub']) # indexing in df remained accurate
df['counted_if'] = df['some_no'].groupby(df['pub_day']).transform('sum')
df2=df
df2=df2.drop_duplicates(subset['pub_day'],keep='first',inplace=False)
df2=df2.drop(df2.columns[[0,1,2,..,17,21]], axis=1)
#drops all 20 columns except for df2.counted_if and df2.pub_day
##hence only 2 columns remaining here: pub_day and counted_if
df2=df2.set_index(['pup_day'])
idx=pd.date_range(min(df2['pub_day']),max(df2['pub_day']))
s = pd.Series(df2.pub_day,df2.counted_if)
s.index = pd.DatetimeIndex(s.index)
s=s.reindex(idx,fill_value=0)
希望这澄清一下。尝试了许多不同的组合。解决方案高度赞赏!
答案 0 :(得分:1)
我为您提供了测试数据的解决方案,以便进行更好的测试:
import pandas as pd
import io
temp=u"""1;2;some_no;18;pub;20;pub_day;counted_if
ab;xy;20;abc;02/02/2002 13:03;2;02/02/2002;24
de;it;4;aso;02/02/2002 11:08;32;02/02/2002;24
hi;as;3;asd;01/02/2002 17:30;8;01/02/2002;3
zu;lu;4;akr;31/01/2002 11:03;12;31/01/2002;5
da;fu;1;lts;31/01/2002 09:03;14;31/01/2002;5
la;di;6;unu;26/01/2002 08:07;3;26/01/2002;6"""
#after testing replace io.StringIO(temp) to filename
df = pd.read_csv(io.StringIO(temp), sep=";")
print df
1 2 some_no 18 pub 20 pub_day counted_if
0 ab xy 20 abc 02/02/2002 13:03 2 02/02/2002 24
1 de it 4 aso 02/02/2002 11:08 32 02/02/2002 24
2 hi as 3 asd 01/02/2002 17:30 8 01/02/2002 3
3 zu lu 4 akr 31/01/2002 11:03 12 31/01/2002 5
4 da fu 1 lts 31/01/2002 09:03 14 31/01/2002 5
5 la di 6 unu 26/01/2002 08:07 3 26/01/2002 6
df['pub'] = pd.to_datetime(df.pub, format='%d/%m/%Y %H:%M')
df['pub_day'] = pd.DatetimeIndex(df.pub).normalize()
#added inplace=True
df.set_index('pub', inplace=True) # indexing in df remained accurate
#better syntax of groupby
df['counted_if'] = df.groupby('pub_day')['some_no'].transform('sum')
print df
1 2 some_no 18 20 pub_day counted_if
pub
2002-02-02 13:03:00 ab xy 20 abc 2 2002-02-02 24
2002-02-02 11:08:00 de it 4 aso 32 2002-02-02 24
2002-02-01 17:30:00 hi as 3 asd 8 2002-02-01 3
2002-01-31 11:03:00 zu lu 4 akr 12 2002-01-31 5
2002-01-31 09:03:00 da fu 1 lts 14 2002-01-31 5
2002-01-26 08:07:00 la di 6 unu 3 2002-01-26 6
#omited, not necessary
#df2=df
df2=df.drop_duplicates(subset=['pub_day'],keep='first')
#simplier is use subset of data by columns
df2=df2[['counted_if','pub_day']]
print df2
counted_if pub_day
pub
2002-02-02 13:03:00 24 2002-02-02
2002-02-01 17:30:00 3 2002-02-01
2002-01-31 11:03:00 5 2002-01-31
2002-01-26 08:07:00 6 2002-01-26
#drops all 20 columns except for df2.counted_if and df2.pub_day
##hence only 2 columns remaining here: pub_day and counted_if
#you have to first reset index before change index to other value
df2.reset_index(inplace=True)
#set column pub_day as index
df2.set_index('pub_day', inplace=True)
#pub_day is index, so use df.index, not df2.pub_day
idx=pd.date_range(df2.index.min(),df2.index.max())
print idx
DatetimeIndex(['2002-01-26', '2002-01-27', '2002-01-28', '2002-01-29',
'2002-01-30', '2002-01-31', '2002-02-01', '2002-02-02'],
dtype='datetime64[ns]', freq='D')
#series is column counted_if
s = df2.counted_if
print s
pub_day
2002-02-02 24
2002-02-01 3
2002-01-31 5
2002-01-26 6
Name: counted_if, dtype: int64
#index is Datetimeindex, omited
#s.index = pd.DatetimeIndex(s.index)
s=s.reindex(idx,fill_value=0)
print s
2002-01-26 6
2002-01-27 0
2002-01-28 0
2002-01-29 0
2002-01-30 0
2002-01-31 5
2002-02-01 3
2002-02-02 24
Freq: D, Name: counted_if, dtype: int64
通过评论编辑:
print df
1 2 some_no 18 pub 20 pub_day counted_if
0 ab xy 20 abc 02/02/2002 13:03 2 02/02/2002 24
1 de it 4 aso 02/02/2002 11:08 32 02/02/2002 24
2 hi as 3 asd 01/02/2002 17:30 8 01/02/2002 3
3 zu lu 4 akr 31/01/2002 11:03 12 31/01/2002 5
4 da fu 1 lts 31/01/2002 09:03 14 31/01/2002 5
5 la di 6 unu 26/01/2002 08:07 3 26/01/2002 6
df['pub'] = pd.to_datetime(df.pub, format='%d/%m/%Y %H:%M')
df['pub_day'] = pd.DatetimeIndex(df.pub).normalize()
df.set_index('pub', inplace=True)
#add columns pub_day (for grouping), and other columns for aggregating (counted_if, 20, ...)
df1 = df[['pub_day', 'counted_if','20']].groupby('pub_day').transform('sum').reset_index()
print df1
pub counted_if 20
0 2002-02-02 13:03:00 48 34
1 2002-02-02 11:08:00 48 34
2 2002-02-01 17:30:00 3 8
3 2002-01-31 11:03:00 10 26
4 2002-01-31 09:03:00 10 26
5 2002-01-26 08:07:00 6 3
#if date in pub_date and pub is same, use dt.date
df1['pub_day'] = df1['pub'].dt.date
print df1
pub counted_if 20 pub_day
0 2002-02-02 13:03:00 48 34 2002-02-02
1 2002-02-02 11:08:00 48 34 2002-02-02
2 2002-02-01 17:30:00 3 8 2002-02-01
3 2002-01-31 11:03:00 10 26 2002-01-31
4 2002-01-31 09:03:00 10 26 2002-01-31
5 2002-01-26 08:07:00 6 3 2002-01-26
df2=df1.drop_duplicates(subset='pub_day',keep='first')
print df2
pub counted_if 20 pub_day
0 2002-02-02 13:03:00 48 34 2002-02-02
2 2002-02-01 17:30:00 3 8 2002-02-01
3 2002-01-31 11:03:00 10 26 2002-01-31
5 2002-01-26 08:07:00 6 3 2002-01-26
#add other columns for aggregating (counted_if, 20, ...), column pub_day is for new index
df2=df2[['counted_if','pub_day', '20']]
print df2
counted_if pub_day 20
0 48 2002-02-02 34
2 3 2002-02-01 8
3 10 2002-01-31 26
5 6 2002-01-26 3
df2.reset_index(inplace=True, drop=True)
df2.set_index('pub_day', inplace=True)
idx=pd.date_range(df2.index.min(),df2.index.max())
#print idx
df2=df2.reindex(idx,fill_value=0)
print df2
counted_if 20
2002-01-26 6 3
2002-01-27 0 0
2002-01-28 0 0
2002-01-29 0 0
2002-01-30 0 0
2002-01-31 10 26
2002-02-01 3 8
2002-02-02 48 34