Question

大家好，我是python的新手。我有两个数据框。一个包含对药物的描述，如下所示：

df1.head(5)

PID  Drug_Admin_Description
1       sodium chloride 0.9% SOLN
2       Nimodipine 30 mg oral
3       Livothirine 20 mg oral
4       Livo tab 112
5       Omega-3 Fatty Acids

其他表格仅包含药品名称，如下所示：

df2.head(5)

Drug_Name 

Sodium chloride 0.5% SOLN
omega-3 Fatty Acids
gentamicin 40 mg/ml soln
amoxilin 123
abcd 12654

有没有办法我只能提取那些同时存在于df1和df2中的药物。示例输出如下：

new_column

Sodium chloride
omega-3

我尝试在python中使用正则表达式，但无法弄清楚我将如何应用它。在此先感谢

Answer 1

一种可能性是使用difflib库中的get_close_matches。

import pandas as pd
import difflib

drug_description = ["sodium chloride 0.9% SOLN","Nimodipine 30 mg oral",
                    "Livothirine 20 mg oral", "Livo tab 112",
                    "Omega-3 Fatty Acids"]

df1 = pd.DataFrame({"Drug_Admin_Description":drug_description})


drug_name = ["Sodium chloride 0.5% SOLN", "omega-3 Fatty Acids",
            "gentamicin 40 mg/ml soln", "amoxilin 123", "abcd 12654"]

df2 = pd.DataFrame({"Drug_Name":drug_name})
# The above code is to create the dataframe with the information you provided



match_list = [] # We will append this list with the drug names that are similar to the drugs in Drug_Admin_description

for drug in df1["Drug_Admin_Description"]:
    match_test = difflib.get_close_matches(drug, drug_name, n=1)
    if len(match_test) == 0: #if the match is less then 60% similarity it will return a blank list
        pass
    else:
        match_list.append(match_test[0]) #we will take the only item in that list and append it to our match list

df3 = pd.DataFrame({"new_column":match_list}) #we will then make a dataframe of the matches.

此处是指向get_close_matches的以下文档的链接。您可以传入cutoff参数来确定每个单词所需的匹配百分比。 https://docs.python.org/2/library/difflib.html#difflib.get_close_matches

Answer 2

可能的解决方案之一：

要从DataFrame的列中获取名称，请定义以下函数：

def getNames(src, colName):
    res = src.str.split(r' [\d.%]+ ?', n=1, expand=True).drop(1, 'columns')
    res.set_index(res[0].str.upper(), inplace=True)
    res.index.name = None
    res.columns = [colName]
    return res

我注意到每种药物名称都可以包含一个“数字部分” （一个空间，一个数字字母序列，包括一个点或一个百分比字符）。

因此，此函数在此模式下拆分每个名称，并且仅采用第一个“细分”。

然后注意上/下存在差异情况，因此每个名称列表必须具有包含大写相同的名称（因此可以将两个名称列表都放在索引上）。

然后为两个源列调用此函数：

n1 = getNames(df1.Drug_Admin_Description, 'Name')
n2 = getNames(df2.Drug_Name, 'Name2')

要获得最终结果，请运行：

n1.join(n2, how='inner').drop('Name2', 'columns').reset_index(drop=True)

与您想要的结果相比，有一个差异，即 Omega-3脂肪酸是全文的结果。

根据我选择的标准，该名称包含没有数字部分。唯一的数字（3）是名称的组成部分，并且没有这个地方之后的数字。因此，我认为在这种情况下没有什么可以“切断”的。

如何比较两列不同的数据框并创建一个新的

2 个答案: