将一列的字符串匹配并标记为另一列的子字符串

时间:2020-08-07 13:10:30

标签: python pandas automation nlp jupyter-notebook

我需要Python代码,该代码将x,y列中的字符串与Z列中的这些子字符串匹配,并用标记后的子字符串版本替换子字符串,如下所示

输入:未标记的子字符串

    Target  Effect  Sentence
0   "hsp9   "insulin sensitivity" "treatment of fhrs with doxycycline attenuated the decrease in enos and hsp90 expression but did not improve insulin sensitivity."
1   "hsp90"    "apoptosis"   "radicicol, an inhibitor of hsp90, enhances trail-induced apoptosis in human epithelial ovarian carcinoma cells by promoting activation of apoptosis-related proteins."

输出:标记的子字符串

    Target  Effect  Sentence
0   "hsp90"    "insulin sensitivity"   "treatment of fhrs with doxycycline attenuated the decrease in enos and <e1>hsp90</e1> expression but did not improve <e2>insulin sensitivity</e2>."
1   "hsp90"    "apoptosis"    "radicicol, an inhibitor of <e1>hsp90</e1>, enhances trail-induced apoptosis in human epithelial ovarian carcinoma cells by promoting activation of <e2>apoptosis</e2>-related proteins."

我想使用熊猫和数据框来做到这一点。 使用上面的示例,我将如何完成这样的任务。

1 个答案:

答案 0 :(得分:0)

使用apply()并将每列视为 reg expr 匹配项来进行s替换,这很简单。

import re
data = '''    Target  Effect  Sentence
0   hsp90   insulin sensitivity   "treatment of fhrs with doxycycline attenuated the decrease in enos and hsp90 expression but did not improve insulin sensitivity."
1   hsp90    apoptosis   "radicicol, an inhibitor of hsp90, enhances trail-induced apoptosis in human epithelial ovarian carcinoma cells by promoting activation of apoptosis-related proteins."'''
a = [[t.strip() for t in re.split("  ",l) if t!=""]  for l in [re.sub("([0-9]+[ ])*(.*)", r"\2", l) for l in data.split("\n")]]
df = pd.DataFrame(a[1:], columns=a[0])

df["Sentence"] = df.apply(lambda r: re.sub(f"({r['Effect']})", r"<e2>\1</e2>", 
                          re.sub(f"({r['Target']})", r"<e1>\1</e1>", r["Sentence"])), axis=1)
print(df.to_string(index=False))


输出

Target               Effect                                                                                                                                                                                            Sentence
 hsp90  insulin sensitivity                                                "treatment of fhrs with doxycycline attenuated the decrease in enos and <e1>hsp90</e1> expression but did not improve <e2>insulin sensitivity</e2>."
 hsp90            apoptosis  "radicicol, an inhibitor of <e1>hsp90</e1>, enhances trail-induced <e2>apoptosis</e2> in human epithelial ovarian carcinoma cells by promoting activation of <e2>apoptosis</e2>-related proteins."