Question

我有一个数据框：

Date        Articles
2010-01-04  ((though, reliant, advertis, revenu, internet,...
2010-01-05  ((googl, expect, nexus, one, rival, iphon, hel...
2010-01-06  ((while, googl, introduc, first, piec, hardwar...
2010-01-07  ((googl, form, energi, subsidiari, appli, gove...
2010-01-08  ((david, pogu, review, googl, new, offer, nexu...
2010-01-12  ((the, compani, agre, hand, list, book, scan, ...

日期是索引，而文章是元组的元组。

我有另一个Dataframe：

Date        Price
2010-01-08  602.020
2010-01-15  580.000
2010-01-22  550.010
2010-01-29  529.944

其中日期也是索引，但分为几周。

我的问题是，我想在第二个数据框中创建另一个列，其中包含指定特定周的所有文章，由索引指示。就像我的第二个数据帧的第一行一样，我希望在2010-01-08之前从我的第一个数据帧中获取所有文章（这将是我第一个数据帧中的前4个条目）。就像2010-01-15那样明智，我需要2010-01-08至2010-01-14的所有文章，等等。

任何帮助将不胜感激。感谢。

Answer 1

我们可以使用IntervalIndex.from_breaks和pd.cut

df1 = pd.DataFrame({'Articles': 
                   {pd.Timestamp('2010-01-04 00:00:00'): [0, 1],
                    pd.Timestamp('2010-01-05 00:00:00'): [2, 3],
                    pd.Timestamp('2010-01-06 00:00:00'): [4, 5],
                    pd.Timestamp('2010-01-07 00:00:00'): [6, 7],
                    pd.Timestamp('2010-01-08 00:00:00'): [8, 9],
                    pd.Timestamp('2010-01-12 00:00:00'): [10, 11]}})

            Articles
2010-01-04  [0, 1]
2010-01-05  [2, 3]
2010-01-06  [4, 5]
2010-01-07  [6, 7]
2010-01-08  [8, 9]
2010-01-12  [10, 11]

mybins = pd.IntervalIndex.from_breaks(
             pd.date_range("2010-1-1", periods=5, freq="7D"),
             closed="left"
         )

df1["bin"] = pd.cut(df1.index, bins=mybins)
df1.groupby("bin")["Articles"].sum()

bin
[2010-01-01, 2010-01-08)    [0, 1, 2, 3, 4, 5, 6, 7]
[2010-01-08, 2010-01-15)              [8, 9, 10, 11]
[2010-01-15, 2010-01-22)                        None
[2010-01-22, 2010-01-29)                        None
Name: Articles, dtype: object

Answer 2

以下是使用merge_asof和allow_exact_matches=False的两步解决方案，以便每个文章行与日期严格大于（不等于）的第一个价格匹配）文章行的日期。

.agg(sum)使用添加两个元组将它们组合成一个元组的事实。

假设您的DataFrame名为df和df2：

# Test data adapted from your examples.
# Sorry that this is difficult to copy-paste into pandas

df
            Articles
2010-01-04  (though, reliant, advertis, revenu, internet)        
2010-01-05  ((googl, expect, nexus), (one, rival, iphon))        
2010-01-06  ((while, googl, introduc), (first,), (piec, hardwar))
2010-01-07  ((googl, form), (energi, subsidiari), (appli,))      
2010-01-08  ((david, pogu, review), (googl, new, offer))         
2010-01-12  ((the, compani), (agre, hand, list), (book, scan)) 

df2
            Price               
2010-01-08  602.020
2010-01-15  580.000
2010-01-22  550.010
2010-01-29  529.944


# Solution

price2articles = (pd.merge_asof(df, 
                               df2, 
                               left_index=True, 
                               right_index=True, 
                               allow_exact_matches=False,
                               direction='forward')
                .groupby('Price')
                .agg(sum))

result = pd.merge(df2, price2article, left_on='Price', right_index=True)
# To see full contents of wide data, set
# pd.options.display.max_colwidth = 150 or higher (-1 for no limit)
result

            Articles                                                                                                                                                                                                          
2010-01-08  (though, reliant, advertis, revenu, internet, (googl, expect, nexus), (one, rival, iphon), (while, googl, introduc), (first,), (piec, hardwar), (googl, form), (energi, subsidiari), (appli,))  
2010-01-15  ((david, pogu, review), (googl, new, offer), (the, compani), (agre, hand, list), (book, scan))

Answer 3

我认为需要df2['Date']的值为list with groupby，并将元组连接到print (df1) Date Articles 0 2010-01-04 ((t, r), (s, q)) 1 2010-01-07 ((g, f), (y, l)) 2 2010-01-08 ((d, p), (t, o)) 3 2010-01-12 ((t, c), (r, p)) b = pd.concat([df2['Date'], pd.Series(pd.to_datetime(['1970-01-01','2100-01-01']))]).sort_values() df1['Dates'] = pd.cut(df1['Date'], bins=b, labels=b[1:], right=False) df3 = (df1.groupby('Dates')['Articles'] .apply(lambda x: [i for s in x for i in s]) .iloc[:-1] .reset_index()) print (df3) Dates Articles 0 2010-01-08 [(t, r), (s, q), (g, f), (y, l)] 1 2010-01-15 [(d, p), (t, o), (t, c), (r, p)] 2 2010-01-22 [] 3 2010-01-29 [] s：

lists

最后，如果想要过滤掉空df3 = df3[df3['Articles'].astype(bool)] print (df3) Dates Articles 0 2010-01-08 [(t, r), (s, q), (g, f), (y, l)] 1 2010-01-15 [(d, p), (t, o), (t, c), (r, p)]：

cout

Answer 4

也许这个相当简单的双线也可以起作用：（这利用了2010年1月8日没有休息的日历周，而是在1月11日左右）

for (int i = 0; i < 15 ; i++) {
    StudentEntry student = new StudentEntry();
    student.name = txtFirstName.getText() + " " + txtLastName.getText();
    ...
    studentBook.add(student);
}

如果您想要实际的一天，我们可以修改此代码以使用日历日的div：

m = {ind:dfx['Articles'].tolist() for ind,dfx in df1.groupby(df1.index.week)} 
df2['new'] = pd.Series(df2.index.week).map(m).values

完整示例：

m = {ind+1:dfx['Articles'].tolist() for ind,dfx in df1.groupby((df1.index.dayofyear-1)//7)}
df2['new'] = pd.Series(df2.index.week).map(m).values

DF2：

import pandas as pd

data1 = '''\
Date        Articles
2010-01-04  1
2010-01-05  2
2010-01-06  3
2010-01-07  4
2010-01-08  5'''

data2 = '''\
Date        Price
2010-01-08  602.020
2010-01-15  580.000
2010-01-22  550.010
2010-01-29  529.944'''

df1 = pd.read_csv(pd.compat.StringIO(data1), sep='\s+', index_col='Date', parse_dates=['Date'])
df2 = pd.read_csv(pd.compat.StringIO(data2), sep='\s+', index_col='Date', parse_dates=['Date'])

m = {ind:dfx['Articles'].tolist() for ind,dfx in df1.groupby(df1.index.week)}

df2['new'] = pd.Series(df2.index.week).map(m).values

或：

              Price              new
Date                                
2010-01-08  602.020  [1, 2, 3, 4, 5]
2010-01-15  580.000              NaN
2010-01-22  550.010              NaN
2010-01-29  529.944              NaN

按周分组一个Dataframe

4 个答案: