Question

我使用Pandas Python库来比较两个数据帧，每个数据帧由一列日期和两列值组成。其中一个数据框称为LongDF，包含的日期多于另一个，称之为ShortDF。这两个数据框都使用pandas.tseries.index.DatetimeIndex按日期编制索引。请参见下文（我已将其缩短为仅演示）。

LongDF

╔════════════╦════════╦════════╗
║ Date       ║ Value1 ║ Value2 ║
╠════════════╬════════╬════════╣
║ 1990-03-17 ║ 6.84   ║ 1.77   ║
║ 1990-03-18 ║ 0.99   ║ 7.00   ║
║ 1990-03-19 ║ 4.90   ║ 8.48   ║
║ 1990-03-20 ║ 2.57   ║ 2.41   ║
║ 1990-03-21 ║ 4.10   ║ 8.33   ║
║ 1990-03-22 ║ 8.86   ║ 1.31   ║
║ 1990-03-23 ║ 6.01   ║ 6.22   ║
║ 1990-03-24 ║ 0.74   ║ 1.69   ║
║ 1990-03-25 ║ 5.56   ║ 7.30   ║
║ 1990-03-26 ║ 8.05   ║ 1.67   ║
║ 1990-03-27 ║ 8.87   ║ 8.22   ║
║ 1990-03-28 ║ 9.00   ║ 6.83   ║
║ 1990-03-29 ║ 1.34   ║ 6.00   ║
║ 1990-03-30 ║ 1.69   ║ 0.40   ║
║ 1990-03-31 ║ 8.71   ║ 3.26   ║
║ 1990-04-01 ║ 4.05   ║ 4.53   ║
║ 1990-04-02 ║ 9.75   ║ 4.79   ║
║ 1990-04-03 ║ 7.74   ║ 0.44   ║
╚════════════╩════════╩════════╝

ShrotDF

╔════════════╦════════╦════════╗
║ Date       ║ Value1 ║ Value2 ║
╠════════════╬════════╬════════╣
║ 1990-03-25 ║ 1.98   ║ 3.92   ║
║ 1990-03-26 ║ 3.37   ║ 3.40   ║
║ 1990-03-27 ║ 2.93   ║ 7.93   ║
║ 1990-03-28 ║ 2.35   ║ 5.34   ║
║ 1990-03-29 ║ 1.41   ║ 7.62   ║
║ 1990-03-30 ║ 9.85   ║ 3.17   ║
║ 1990-03-31 ║ 9.95   ║ 0.35   ║
║ 1990-04-01 ║ 4.42   ║ 7.11   ║
║ 1990-04-02 ║ 1.33   ║ 6.47   ║
║ 1990-04-03 ║ 6.63   ║ 1.78   ║
╚════════════╩════════╩════════╝

我想做的事情引用每个数据集中同一天发生的数据，将两组中的数据放入一个公式中，如果如果它大于某个数字，请将日期和值粘贴到另一个数据框中。

我假设我应该使用类似for row in ShortDF.iterrows():的东西来迭代ShortDF上的每个日期，但我无法弄清楚如何使用{在LongDF上选择相应的行{1}}。

任何帮助将不胜感激

Answer 1

好的我现在醒着并使用你的数据你可以这样做：

In [425]:
# key here is to tell the merge to use both sides indices
merged = df1.merge(df2,left_index=True, right_index=True)
# the resultant merged dataframe will have duplicate columns, this is fine
merged
Out[425]:
            Value1_x  Value2_x  Value1_y  Value2_y
Date                                              
1990-03-25      5.56      7.30      1.98      3.92
1990-03-26      8.05      1.67      3.37      3.40
1990-03-27      8.87      8.22      2.93      7.93
1990-03-28      9.00      6.83      2.35      5.34
1990-03-29      1.34      6.00      1.41      7.62
1990-03-30      1.69      0.40      9.85      3.17
1990-03-31      8.71      3.26      9.95      0.35
1990-04-01      4.05      4.53      4.42      7.11
1990-04-02      9.75      4.79      1.33      6.47
1990-04-03      7.74      0.44      6.63      1.78

[10 rows x 4 columns]
In [432]:
# now using boolean indexing we want just the rows where there are values larger than 9 and then select the highest value
merged[merged.max(axis=1) > 9].max(axis=1)
Out[432]:
Date
1990-03-30    9.85
1990-03-31    9.95
1990-04-02    9.75
dtype: float64

Answer 2

好的，所以有时候我喜欢把pandas DataFrames看成是词典。这是因为使用字典非常容易，并且像简单的dicts一样思考它们通常意味着您可以找到问题的解决方案，而无需深入了解大熊猫。

因此，在您的示例中，如果DataFrames的值通过某些值测试，我只会创建一个公共日期列表，然后使用这些日期创建一个新数据框来访问现有数据框中的值。在我的例子中，测试是DF1中的值1 + DF2中的值2是否大于10：

import pandas as pd
import random 
random.seed(123)

#Create some data
DF1 = pd.DataFrame({'Date'      :   ['1990-03-17', '1990-03-18', '1990-03-19', 
                                     '1990-03-20', '1990-03-21', '1990-03-22', 
                                     '1990-03-23', '1990-03-24', '1990-03-25', 
                                     '1990-03-26', '1990-03-27', '1990-03-28',
                                     '1990-03-29', '1990-03-30', '1990-03-31', 
                                     '1990-04-01', '1990-04-02', '1990-04-03'],
                    'Value1'    :   [round(random.uniform(1, 10), 2) 
                                     for x in xrange(18)],
                    'Value2'    :   [round(random.uniform(1, 10), 2) 
                                     for x in xrange(18)]
                   })

DF2 = pd.DataFrame({'Date'      :   ['1990-03-25', '1990-03-26', '1990-03-27', 
                                     '1990-03-28', '1990-03-29', '1990-03-30', 
                                     '1990-03-31', '1990-04-01', '1990-04-02',  
                                     '1990-04-03'],
                    'Value1'    :   [round(random.uniform(1, 10), 2) 
                                     for x in xrange(10)],
                    'Value2'    :   [round(random.uniform(1, 10), 2) 
                                     for x in xrange(10)]
                   })

DF1.set_index('Date', inplace = True)
DF2.set_index('Date', inplace = True)

#Create a list of common dates, where the values of DF1.Value1  summed 
#with DF.Value2 is greater than 10
Common_Set = list(DF1.index.intersection(DF2.index))
Common_Dates =  [date for date in Common_Set if 
             DF1.Value1[date] + DF2.Value1[date] > 10]

#And now create the data frame I think you want using the Common_Dates

DF_Output = pd.DataFrame({'L_Value1' : [DF1.Value1[date] for date in Common_Dates],
                          'L_Value2' : [DF1.Value2[date] for date in Common_Dates],
                          'S_Value1' : [DF2.Value1[date] for date in Common_Dates],
                          'S_Value2' : [DF2.Value2[date] for date in Common_Dates]
                         }, index = Common_Dates)

正如评论所示，这绝对可以在熊猫中使用，但对我来说这是一个简单的解决方案。 Common_Dates操作可以很容易地在一行中完成，但我并不清楚。

当然，如果两个数据框中都有很多列，那么写出DF_Output DataFrame构造函数可能会非常痛苦。如果是这种情况，那么你可以这样做：

DF1_Out = {'L' + col : [DF1[col][date] for date in Common_Dates] 
            for col in DF1.columns}
DF2_Out = {'S' + col : [DF2[col][date] for date in Common_Dates] 
            for col in DF2.columns}

DF_Out = {}
DF_Out.update(DF1_Out)
DF_Out.update(DF2_Out)

DF_Output2 = pd.DataFrame(DF_Out, index = Common_Dates)

这两种方法都给了我这个：

            LValue1  LValue2  SValue1  SValue2
1990-03-25     8.67     6.16     3.84     4.37
1990-03-27     4.03     8.54     7.92     7.79
1990-03-29     3.21     4.09     7.16     8.38
1990-03-31     4.93     2.86     7.00     6.92
1990-04-01     1.79     6.48     9.01     2.53
1990-04-02     6.38     5.74     5.38     4.03

这不会满足我想象的很多人，但这是我解决它的方式。附：如果你可以做腿部工作将是很好的：在后续问题中创建数据框架。

使用Datetimeindex选择行

2 个答案: