按周分组一个Dataframe

时间:2018-04-01 16:01:26

标签: python pandas numpy dataframe

我有一个数据框:

Date        Articles
2010-01-04  ((though, reliant, advertis, revenu, internet,...
2010-01-05  ((googl, expect, nexus, one, rival, iphon, hel...
2010-01-06  ((while, googl, introduc, first, piec, hardwar...
2010-01-07  ((googl, form, energi, subsidiari, appli, gove...
2010-01-08  ((david, pogu, review, googl, new, offer, nexu...
2010-01-12  ((the, compani, agre, hand, list, book, scan, ...

日期是索引,而文章是元组的元组。

我有另一个Dataframe:

Date        Price
2010-01-08  602.020
2010-01-15  580.000
2010-01-22  550.010
2010-01-29  529.944

其中日期也是索引,但分为几周。

我的问题是,我想在第二个数据框中创建另一个列,其中包含指定特定周的所有文章,由索引指示。就像我的第二个数据帧的第一行一样,我希望在2010-01-08之前从我的第一个数据帧中获取所有文章(这将是我第一个数据帧中的前4个条目)。就像2010-01-15那样明智,我需要2010-01-08至2010-01-14的所有文章,等等。

任何帮助将不胜感激。感谢。

4 个答案:

答案 0 :(得分:1)

我们可以使用IntervalIndex.from_breakspd.cut

df1 = pd.DataFrame({'Articles': 
                   {pd.Timestamp('2010-01-04 00:00:00'): [0, 1],
                    pd.Timestamp('2010-01-05 00:00:00'): [2, 3],
                    pd.Timestamp('2010-01-06 00:00:00'): [4, 5],
                    pd.Timestamp('2010-01-07 00:00:00'): [6, 7],
                    pd.Timestamp('2010-01-08 00:00:00'): [8, 9],
                    pd.Timestamp('2010-01-12 00:00:00'): [10, 11]}})

            Articles
2010-01-04  [0, 1]
2010-01-05  [2, 3]
2010-01-06  [4, 5]
2010-01-07  [6, 7]
2010-01-08  [8, 9]
2010-01-12  [10, 11]

mybins = pd.IntervalIndex.from_breaks(
             pd.date_range("2010-1-1", periods=5, freq="7D"),
             closed="left"
         )

df1["bin"] = pd.cut(df1.index, bins=mybins)
df1.groupby("bin")["Articles"].sum()

bin
[2010-01-01, 2010-01-08)    [0, 1, 2, 3, 4, 5, 6, 7]
[2010-01-08, 2010-01-15)              [8, 9, 10, 11]
[2010-01-15, 2010-01-22)                        None
[2010-01-22, 2010-01-29)                        None
Name: Articles, dtype: object

答案 1 :(得分:0)

以下是使用merge_asofallow_exact_matches=False的两步解决方案,以便每个文章行与日期严格大于(不等于)的第一个价格匹配)文章行的日期。

.agg(sum)使用添加两个元组将它们组合成一个元组的事实。

假设您的DataFrame名为dfdf2

# Test data adapted from your examples.
# Sorry that this is difficult to copy-paste into pandas

df
            Articles
2010-01-04  (though, reliant, advertis, revenu, internet)        
2010-01-05  ((googl, expect, nexus), (one, rival, iphon))        
2010-01-06  ((while, googl, introduc), (first,), (piec, hardwar))
2010-01-07  ((googl, form), (energi, subsidiari), (appli,))      
2010-01-08  ((david, pogu, review), (googl, new, offer))         
2010-01-12  ((the, compani), (agre, hand, list), (book, scan)) 

df2
            Price               
2010-01-08  602.020
2010-01-15  580.000
2010-01-22  550.010
2010-01-29  529.944


# Solution

price2articles = (pd.merge_asof(df, 
                               df2, 
                               left_index=True, 
                               right_index=True, 
                               allow_exact_matches=False,
                               direction='forward')
                .groupby('Price')
                .agg(sum))

result = pd.merge(df2, price2article, left_on='Price', right_index=True)
# To see full contents of wide data, set
# pd.options.display.max_colwidth = 150 or higher (-1 for no limit)
result

            Articles                                                                                                                                                                                                          
2010-01-08  (though, reliant, advertis, revenu, internet, (googl, expect, nexus), (one, rival, iphon), (while, googl, introduc), (first,), (piec, hardwar), (googl, form), (energi, subsidiari), (appli,))  
2010-01-15  ((david, pogu, review), (googl, new, offer), (the, compani), (agre, hand, list), (book, scan))

答案 2 :(得分:0)

我认为需要df2['Date']的值为list with groupby,并将元组连接到print (df1) Date Articles 0 2010-01-04 ((t, r), (s, q)) 1 2010-01-07 ((g, f), (y, l)) 2 2010-01-08 ((d, p), (t, o)) 3 2010-01-12 ((t, c), (r, p)) b = pd.concat([df2['Date'], pd.Series(pd.to_datetime(['1970-01-01','2100-01-01']))]).sort_values() df1['Dates'] = pd.cut(df1['Date'], bins=b, labels=b[1:], right=False) df3 = (df1.groupby('Dates')['Articles'] .apply(lambda x: [i for s in x for i in s]) .iloc[:-1] .reset_index()) print (df3) Dates Articles 0 2010-01-08 [(t, r), (s, q), (g, f), (y, l)] 1 2010-01-15 [(d, p), (t, o), (t, c), (r, p)] 2 2010-01-22 [] 3 2010-01-29 [] s:

lists

最后,如果想要过滤掉空df3 = df3[df3['Articles'].astype(bool)] print (df3) Dates Articles 0 2010-01-08 [(t, r), (s, q), (g, f), (y, l)] 1 2010-01-15 [(d, p), (t, o), (t, c), (r, p)]

cout

答案 3 :(得分:0)

也许这个相当简单的双线也可以起作用: (这利用了2010年1月8日没有休息的日历周,而是在1月11日左右)

for (int i = 0; i < 15 ; i++) {
    StudentEntry student = new StudentEntry();
    student.name = txtFirstName.getText() + " " + txtLastName.getText();
    ...
    studentBook.add(student);
}

如果您想要实际的一天,我们可以修改此代码以使用日历日的div:

m = {ind:dfx['Articles'].tolist() for ind,dfx in df1.groupby(df1.index.week)} 
df2['new'] = pd.Series(df2.index.week).map(m).values

完整示例

m = {ind+1:dfx['Articles'].tolist() for ind,dfx in df1.groupby((df1.index.dayofyear-1)//7)}
df2['new'] = pd.Series(df2.index.week).map(m).values

DF2:

import pandas as pd

data1 = '''\
Date        Articles
2010-01-04  1
2010-01-05  2
2010-01-06  3
2010-01-07  4
2010-01-08  5'''

data2 = '''\
Date        Price
2010-01-08  602.020
2010-01-15  580.000
2010-01-22  550.010
2010-01-29  529.944'''

df1 = pd.read_csv(pd.compat.StringIO(data1), sep='\s+', index_col='Date', parse_dates=['Date'])
df2 = pd.read_csv(pd.compat.StringIO(data2), sep='\s+', index_col='Date', parse_dates=['Date'])

m = {ind:dfx['Articles'].tolist() for ind,dfx in df1.groupby(df1.index.week)}

df2['new'] = pd.Series(df2.index.week).map(m).values

或:

              Price              new
Date                                
2010-01-08  602.020  [1, 2, 3, 4, 5]
2010-01-15  580.000              NaN
2010-01-22  550.010              NaN
2010-01-29  529.944              NaN