熊猫多重合并创建多维重复列

时间:2020-08-14 15:20:44

标签: python pandas dataframe merge duplicates

我的目标是基于相似的主机名,序列号和类别,将4个excel工作表合并为1个。我正在使用下面的pandas合并功能。

??

问题在于,每个工作表都有一个“ IP地址”列,其中大多数IP都是相似的。由于某种原因,合并数据帧包含4列,具有2个重复的名称:“ IP地址_x”,“ IP地址_x”,“ IP地址_y”,“ IP地址_y”

我想将这4列合并为1,但是我不能,因为它们具有相似的名称。我没有手动重命名它们,因为有大约30个数据框列,而且很乏味。

有没有可以合并它们的说法:

  1. 如果IP相同,则将其合并
  2. 如果IP不同,请使用左侧的第一行“ IP Address_x”
  3. 如果缺少一列,如果IP不为空,则仅第一个“ IP Address_x”

这是工作表的示例,我还有更多列,例如:名称,网址,站点名称,城市...

InventoryDf

int.MaxValue

HardwareDf

abc

SoftwareDf

.386

.model flat, stdcall

.stack 4096

ExitProcess PROTO, dwExitCode:DWORD
INCLUDE Irvine32.inc
.data
msg  db "Hello again, World!",0
.code

main Proc



    INVOKE ExitProcess, 0
main ENDP
END main

CoverageDf

InventoryDf = pd.read_excel("Inventory.xlsx", sheet_name='Inventory')
SoftwareDf = pd.read_excel("Inventory.xlsx", sheet_name='Software')
HardwarewareDf = pd.read_excel("Inventory.xlsx", sheet_name='Hardware')
CoverageDf = pd.read_excel("Inventory.xlsx", sheet_name='Coverage')
data_frames = [InventoryDf, SoftwareDf, HardwarewareDf, CoverageDf]
merge = partial(pd.merge, on=['Priority','Category','Product Family','Host Name','Serial Number'], how='outer')
merge = reduce(merge, data_frames)

预期结果(即使SwitchA的IP地址不同,IP地址也会合并)

+-----------+---------------+------------+----------+----------+
| Host Name | Serial Number | IP Address | Priority | Category |
+-----------+---------------+------------+----------+----------+
| SwitchA   | 1230          | 1.1.1.1    | 1        | Switch   |
+-----------+---------------+------------+----------+----------+
| SwitchA   | 1231          | 1.1.1.1    | 1        | Switch   |
+-----------+---------------+------------+----------+----------+
| SwitchB   | 1240          | 1.1.1.2    | 2        | Switch   |
+-----------+---------------+------------+----------+----------+

原始结果摘录。注意丢失冗余列IP Address_x

+-----------+---------------+------------+----------+----------+
| Host Name | Serial Number | IP Address | Priority | Category |
+-----------+---------------+------------+----------+----------+
| SwitchA   | 1230          | 1.1.0.1    | 1        | Switch   |
+-----------+---------------+------------+----------+----------+
| SwitchD   | 1250          | 1.2.2.2    | 1        | Switch   |
+-----------+---------------+------------+----------+----------+
| SwitchE   | 1260          | 1.3.3.3    | 2        | Switch   |
+-----------+---------------+------------+----------+----------+

1 个答案:

答案 0 :(得分:1)

从使用functools的高级技术开始。将inspect添加到组合get variable name

  1. 遍历您的数据框列表。捕获名称并重命名 IP地址
  2. 将合并的数据框重命名为最左侧的 IP地址
  3. 从其他 IP地址列中
  4. fillna()并将其删除
import inspect
import functools

def retrieve_name(var):
    callers_local_vars = inspect.currentframe().f_back.f_locals.items()
    return [var_name for var_name, var_val in callers_local_vars if var_val is var]

data_frames = [InventoryDf, SoftwareDf, HardwareDf, CoverageDf]
names = []
for df in data_frames:
    n = retrieve_name(df)[1].replace("Df", "")
    names.append(n)
    df.columns = [f"{n} {c}" if c=="IP Address" else c for c in df.columns]
# merge = functools.partial(pd.merge, on=['Priority','Category','Product Family','Host Name','Serial Number'], how='outer')
merge = functools.partial(pd.merge, on=['Priority','Category','Host Name','Serial Number'], how='outer')

merge = functools.reduce(merge, data_frames)

# take column LHS IP Address and rename it to "IP Address", fillna() from all subsequent columns
# then drop them
merge.rename(columns={f"{names[0]} IP Address":"IP Address"}, inplace=True)
for n in names[1:]:
    merge.loc[:,"IP Address"].fillna(merge.loc[:,f"{n} IP Address"], inplace=True)
    merge.drop(columns=f"{n} IP Address", inplace=True)