Question

我正在尝试从XML RSS feed的“描述”元素中提取“抽象”子字符串。代码段：

import feedparser

rss = 'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/erss.cgi?rss_guid=1RGmO3jHeXUu8o2CWPinET6JLLik93hwR2IAJ5mU-YzoPeX1-O'
feed = feedparser.parse(rss)

for post in feed.entries:
   print (post.description)

我要从第一项开始的字符串嵌入在<p>Abstract<br/> ...... <br/>之间的描述中：

The recent development of electronic logbooks with secure off-device data storage provides a rich resource for research. We present the largest analysis of anaesthetic logbooks to date, with data from 494,235 cases logged by 964 anaesthetists over a 4-year period. Our analysis describes and compares the annual case-load and supervision levels of different grades of anaesthetists across the UK and Republic of Ireland. We calculated the number of cases undertaken per year by grade (median (IQR [range]) core trainees = 388 (252-512 [52-1204]); specialist trainees = 344 (228-480 [52-1144]); and consultants = 328 (204-500 [52-1316]). Overall, the proportion of cases undertaken with direct consultant supervision was 56.7% and 41.6% for core trainees and specialist trainees, respectively. The proportion of supervised cases reduced out-of-hours, for both core trainees (day 93.5%, evening 86.3%, night 78.6%) and specialist trainees (day 81.0%, evening 67.7%, night 56.4%)

我不确定如何将描述中的所有其他内容与摘要分开。我猜我使用了正则表达式搜索，我尝试过类似子字符串问题的答案，但可能由于html标签而无法正常工作。

非常感谢

Answer 1

要解析PubMed供稿结果，可以使用BeautifulSoup。

from bs4 import BeautifulSoup

# find all p(aragraph) tags which have "Abstract" in their text
def find_abstract(tag):
    return tag.name == 'p' and tag.text.startswith('Abstract')

soup = BeautifulSoup(post.description)

# find the abstract
result = soup.find(find_abstract)

# format the text: keep everything but the first line
abstract = '\n'.join(result.text.splitlines()[1:]).strip()
print(abstract)

从XML元素中提取子字符串

1 个答案: