Question

我正在尝试从网络上抓取一些研究摘要，并将某些单词合并在一起。不幸的是，我只能做类似outputexample.replace("WordMerge","")的事情，这还不够一致。

例如，在我的代码提供的URL中，输出的第一行是：

AbstractsPublic AbstractDownload this abstract: English (pdf) | Español (pdf) | Audio Recording (mp3)

我想防止这种情况的发生，并保持尽可能多的原始文本和格式。

 import requests
 import time
 from bs4 import BeautifulSoup
 import re

 urlsummary ='https://www.pcori.org/research-results/2013/testing-new- 
 ways-schedule-appointments-community-health-centers-help-patients'
 html = requests.get(urlsummary).content
 soup = BeautifulSoup(html, 'lxml')

 abstract = soup.find(class_='pane pane--node').get_text()
 print(abstract)

Answer 1

只需使用

.get_text(" ")

来自the docs：

您可以指定一个用于将文本位连接在一起的字符串：

使用Beautifulsoup进行网页抓取-意外合并单词的输出（例如ThisHappens）

1 个答案: