我正在尝试从XML RSS feed的“描述”元素中提取“抽象”子字符串。代码段:
import feedparser
rss = 'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/erss.cgi?rss_guid=1RGmO3jHeXUu8o2CWPinET6JLLik93hwR2IAJ5mU-YzoPeX1-O'
feed = feedparser.parse(rss)
for post in feed.entries:
print (post.description)
我要从第一项开始的字符串嵌入在<p>Abstract<br/> ...... <br/>
之间的描述中:
The recent development of electronic logbooks with secure off-device data storage provides a rich resource for research. We present the largest analysis of anaesthetic logbooks to date, with data from 494,235 cases logged by 964 anaesthetists over a 4-year period. Our analysis describes and compares the annual case-load and supervision levels of different grades of anaesthetists across the UK and Republic of Ireland. We calculated the number of cases undertaken per year by grade (median (IQR [range]) core trainees = 388 (252-512 [52-1204]); specialist trainees = 344 (228-480 [52-1144]); and consultants = 328 (204-500 [52-1316]). Overall, the proportion of cases undertaken with direct consultant supervision was 56.7% and 41.6% for core trainees and specialist trainees, respectively. The proportion of supervised cases reduced out-of-hours, for both core trainees (day 93.5%, evening 86.3%, night 78.6%) and specialist trainees (day 81.0%, evening 67.7%, night 56.4%)
我不确定如何将描述中的所有其他内容与摘要分开。我猜我使用了正则表达式搜索,我尝试过类似子字符串问题的答案,但可能由于html标签而无法正常工作。
非常感谢
答案 0 :(得分:0)
要解析PubMed供稿结果,可以使用BeautifulSoup。
from bs4 import BeautifulSoup
# find all p(aragraph) tags which have "Abstract" in their text
def find_abstract(tag):
return tag.name == 'p' and tag.text.startswith('Abstract')
soup = BeautifulSoup(post.description)
# find the abstract
result = soup.find(find_abstract)
# format the text: keep everything but the first line
abstract = '\n'.join(result.text.splitlines()[1:]).strip()
print(abstract)