使用Python提取包含给定单词的标签之间的文本

时间:2019-01-24 16:29:15

标签: python xml nlp

我从XML文档中提取了一些文本,我试图在其中包含某些单词的标记中提取文本。

例如以下示例:

search('adverse')

应返回所有包含“不利”一词的标签的文本

Out: 
  [
    "<item>The most common adverse reactions reported in subjects receiving coadministered dutasteride and tamsulosin were impotence, decreased libido, breast disorders (including breast enlargement and tenderness), ejaculation disorders, and dizziness.</item>"
  ]

search('clinical')

应该返回两个结果,因为两个标签包含这些单词。

Out: 
  [
    "<title>6.1 Clinical Trials Experience</title>", 
    "<paragraph id="ID41">The clinical efficacy and safety of coadministered dutasteride and tamsulosin, which are individual components of dutasteride and tamsulosin hydrochloride capsules, have been evaluated in a multicenter, randomized, double-blind, parallel group trial (the Combination with Alpha-Blocker Therapy, or CombAT, trial) </paragraph>"
  ]

我应该为此使用什么工具?正则表达式? BS4?任何建议都将不胜感激。


示例文字:

 </highlight>
 </excerpt>
 <component>
 <section id="ID40">
 <id root="fbc21d1a-2fb2-47b1-ac53-f84ed1428bb4"></id>
 <title>6.1 Clinical Trials Experience</title>
 <text>
 <paragraph id="ID41">The clinical efficacy and safety of coadministered dutasteride and tamsulosin, which are individual components of dutasteride and tamsulosin hydrochloride capsules, have been evaluated in a multicenter, randomized, double-blind, parallel group trial (the Combination with Alpha-Blocker Therapy, or CombAT, trial) </paragraph>
 <list id="ID42" listtype="unordered" stylecode="Disc">
 <item>The most common adverse reactions reported in subjects receiving coadministered dutasteride and tamsulosin were impotence, decreased libido, breast disorders (including breast enlargement and tenderness), ejaculation disorders, and dizziness.</item>

1 个答案:

答案 0 :(得分:1)

您可以使用正则表达式对其进行硬编码,也可以使用lxml之类的库来解析xml文件

正则表达式为:

import re

your_text = "(...)"

def search(instr):
    return re.findall(r"<.+>.*{}.*<.+>".format(instr), your_text, re.MULTILINE)

print(search("safety"))