我有两个数据框 df1 和 df2 ,每个数据框都具有相同的列名称,并使用时间戳作为标记。我想合并两个数据框,同时合并具有相同索引的行,选择存储在 df2 中的值作为首选项。这句话措辞不佳,但请参见下文。 例如
# -*- mode: python ; coding: utf-8 -*-
block_cipher = None
a = Analysis(['startup.py'],
pathex=['/home/kenneth/PycharmProjects/universal_predictor'],
binaries=[],
datas=[],
hiddenimports=['models', 'stapp'],
hookspath=['.'],
runtime_hooks=[],
excludes=['torch.distributions'],
win_no_prefer_redirects=False,
win_private_assemblies=False,
cipher=block_cipher,
noarchive=False)
pyz = PYZ(a.pure, a.zipped_data,
cipher=block_cipher)
exe = EXE(pyz,
a.scripts,
a.binaries,
a.zipfiles,
a.datas,
[],
name='startup',
debug=False,
bootloader_ignore_signals=False,
strip=False,
upx=True,
upx_exclude=[],
runtime_tmpdir=None,
console=False , icon='unipredictor-icon.ico')
df3 是我想要实现的目标。这是 df1 和 df2 中每个索引的时间戳。对于db2不是NaN的每个公共索引,我们将使用这些值,否则将保留存储在 df1 中的那些值。
>>> df1= TimeStamp A_Output B_Output C_Output
00:00:00 20 15 5
00:00:06 20 NaN 3
00:00:15 15 6 NaN
00:00:20 20 NaN 5
00:00:30 25 14 10
>>> df2= TimeStamp A_Output B_Output C_Output
00:00:00 15 5 8
00:00:04 16 NaN NaN
00:00:06 17 NaN NaN
00:00:15 NaN NaN 2
00:00:18 19 NaN NaN
00:00:21 14 NaN NaN
00:00:26 32 NaN 5
>>> df3= TimeStamp A_Output B_Output C_Output
00:00:00 15 5 8
00:00:04 16 NaN NaN
00:00:06 17 NaN 3
00:00:15 15 6 2
00:00:18 19 NaN NaN
00:00:21 14 NaN NaN
00:00:26 32 NaN 5
00:00:30 25 14 10
为清楚起见,请参见上面的示例。 我真的找不到办法-作为参考,每个数据框大约有90列和100k +行。
答案 0 :(得分:2)
先尝试结合:
df3 = df2.combine_first(df1)
print(df3)
A_Output B_Output C_Output
TimeStamp
00:00:00 15.0 5.0 8.0
00:00:04 16.0 NaN NaN
00:00:06 17.0 NaN 3.0
00:00:15 15.0 6.0 2.0
00:00:18 19.0 NaN NaN
00:00:20 20.0 NaN 5.0
00:00:21 14.0 NaN NaN
00:00:26 32.0 NaN 5.0
00:00:30 25.0 14.0 10.0