Question

我将熊猫用于此任务。我导入了两个Excel文件：

文件A包含公司名称和其他内容。
文件B包含公司名称，公司所属行业（由1到14之间的数字表示）以及一些我不需要的其他内容。

我想比较这两个文件并找到匹配项，并返回相应的行业编号，然后在文件A中新建一列，以按编号显示公司所属的行业。

到目前为止，我要做的是从每个文件中提取我需要的两列，并将它们放入列表中。然后为了保持联系，我将它们放入字典中。然后，我使用for循环并嵌套for循环来查找匹配项。但是我不知道该怎么走。还有一个问题出现了，这就是两个文件中列出公司的方式有些相同，但又不完全相同。因此，如果名称中的四个以上字符序列匹配，我希望允许它作为匹配项。

com = A["comany"].tolist()
indu = A['industry'].tolist()

sponsor = B["sponsor"].tolist()
event = B["Event"].tolist()

dicA = dict(zip(com, indu))
dicB = dict(zip(sponsor, event))

import re
for spnsr in dicB:
    for company, industry in dicA.items():
        m = re.search(spnsr, company)
        if m:
            m = m.group()
            print(m, industry)

Answer 1

import pandas as pd
import numpy as np

# Read both Excel files
file1 = pd.read_excel("file1.xlsx", na_values=['NA'])
file2 = pd.read_excel("file2.xlsx", na_values=['NA'])


df2 = file1
df1 = file2



res = df1[df1['samecolname'].isin(df2['samecolname'].unique())]
                   
res2 = df2[df2['samecolname'].isin(df1['samecolname'].unique())]               

res.to_excel('diff1-insecond-but-not-in-first.xlsx',index=False)
res2.to_excel('diff2-in-first-not-in-second.xlsx',index=False)

尝试使用两个数据帧查找匹配项并为匹配项生成相应的值

1 个答案: