我不确定这是否意味着合并或加入。
我有一个包含以下列['times','spots']
的数据框,DataFrame A和我有另一个具有相似列['times','spots']
,DataFrame B的数据框。
我想更改数据帧A,使其具有来自A的数据帧B的值,因此在数据帧A中会有一个新的列spot_B。
答案 0 :(得分:2)
好吧,我要去看一个窗台并告诉你如何合并后缀:
# First import our libraries
>>> import pandas as pd
>>> import numpy as np
# Then create our dataframes
>>> df_A = pd.DataFrame(np.random.rand(3,2),columns=['times','spots'])
>>> df_B = pd.DataFrame(np.random.rand(3,2),columns=['times','spots'])
# Set default values
>>> df_A['times'] = [1,2,3]
>>> df_B['times'] = [1,2,3]
>>> df_A['spots'] = [44,55,66]
>>> df_B['spots'] = [77,88,99]
# Here is what both dataframes contain
>>> df_A
times spots
0 1 44
1 2 55
2 3 66
>>> df_B
times spots
0 1 77
1 2 88
2 3 99
# Now the merge -- note: this does not affect the first dataframe in place.
# It will create a new dataframe. You can overwrite the
# first if you set the result to df_A instead of df_merged.
# Note the use of the keyword, suffixes. In the event that the same column names exist
# in both dataframes (that aren't being merged on) Pandas will need to differentiate
# between them. By default same column names will result in a '_x' will be appended to
# the left dataframe column name, and a '_y' to the right dataframe column name
# [order is set by the first two arguments in the merge function].
# The suffixes keyword allows the user to override this behaviour with their
# own version of '_x' and '_y'.
>>> df_merged = pd.merge(df_A,df_B,how='inner',on=['times'],suffixes=['_A','_B'])
>>> df_merged
times spots_A spots_B
0 1 44 77
1 2 55 88
2 3 66 99
现在看来你的问题是,你并不想修改斑点的第一个数据帧列名。这可以通过相同的方式实现,除了使用suffixes=['_A','_B']
使用suffixes=['','_B']
。这实际上将左侧数据帧列后缀设置为空,因此它保持不变:
>>> df_merged = pd.merge(df_A,df_B,how='inner',on=['times'],suffixes=['','_B'])
>>> df_merged
times spots spots_B
0 1 44 77
1 2 55 88
2 3 66 99
瞧!我希望这有帮助。如果我误解了,你实际上在寻找A和B之间的插值,请告诉我,我会编辑这个答案。
*编辑1 *
考虑到你的上一次评论,我认为你正在努力实现这一点。下面我将向您展示如何使用后缀扩展合并,然后使用' time'在内插值中填充spot_B中的NaN。插值方法
# Start by creating out datetimes to set for the times column
>>> times_A = []
>>> times_B = []
>>> for i in range(1,4):
... times_A.append(datetime.datetime(year=2011,month=5,day=i))
...
>>> for i in range(1,6,2):
... times_B.append(datetime.datetime(year=2011,month=5,day=i))
...
# times_A: May 1st, 2011 - May 3rd, 2011
>>> times_A
[datetime.datetime(2011, 5, 1, 0, 0), datetime.datetime(2011, 5, 2, 0, 0), datetime.datetime(2011, 5, 3, 0, 0)]
# times_B: May 1st 2011, May 3rd 2011, May 5th 2011
>>> times_B
[datetime.datetime(2011, 5, 1, 0, 0), datetime.datetime(2011, 5, 3, 0, 0), datetime.datetime(2011, 5, 5, 0, 0)]
# So now times_B is missing May 2nd, and has an extra time, May 5th.
>>> df_A['times'] = times_A
>>> df_B['times'] = times_B
>>> df_A['spots'] = [44,55,66]
>>> df_B['spots'] = [44,66,88]
>>> df_A
times spots
0 2011-05-01 00:00:00 44
1 2011-05-02 00:00:00 55
2 2011-05-03 00:00:00 66
>>> df_B
times spots
0 2011-05-01 00:00:00 44
1 2011-05-03 00:00:00 66
2 2011-05-05 00:00:00 88
# Now it appears you only care about the times in df_A - so
# left merge df_A with df_B (include all times from df_A and
# try to merge with df_B or NaN). Below the date May 5th was dropped.
>>> df_merged = pd.merge(df_A,df_B,how='left',on=['times'],suffixes=['','_B'])
>>> df_merged
times spots spots_B
0 2011-05-01 00:00:00 44 44
1 2011-05-02 00:00:00 55 NaN
2 2011-05-03 00:00:00 66 66
# Here is the important part:
# Since it appears that your data is going to be a time series
# you will need to set your dataframe index to be the times column.
>>> df_merged = df_merged.set_index(['times'])
>>> df_merged
spots spots_B
times
2011-05-01 44 44
2011-05-02 55 NaN
2011-05-03 66 66
# With the times as index we can use the appropriate
# interpolation method for best results
>>> df_merged['spots_B'] = df_merged['spots_B'].interpolate(method='time')
>>> df_merged
spots spots_B
times
2011-05-01 44 44
2011-05-02 55 55
2011-05-03 66 66
interpolate()
上Series
的默认行为是假设每行都是等距的。如果您的时间不是相等的间隔,则需要使用TimeSeries索引重新索引数据帧。当索引是时间序列时,您可以使用method='time'
函数中的interpolate()
参数。