我正在网上抓取各种研究摘要并创建数据集。当我尝试为PCORI摘要执行此操作时,我可以得到所需的内容,但是当文本中有项目符号要点时,项目符号要点:
我是一个新手,虽然我确实在寻找其他代码,但是令人惊讶的是找不到其他人遇到相同的问题。我正在使用的示例是:https://www.pcori.org/research-results/2013/testing-new-ways-schedule-appointments-community-health-centers-help-patients
import requests
from bs4 import BeautifulSoup
import re
import pandas as pd
out = []
urlsummary ='https://www.pcori.org/research-results/2013/testing-new-ways-
schedule-appointments-community-health-centers-help-patients'
html = requests.get(urlsummary).content
soup = BeautifulSoup(html, 'lxml')
abstract = soup.find(class_='pane pane--node').get_text(" ")
about = abstract.split('What was the research about?')[1]
project_status = soup.find(class_='field field-name-field-award-
status').get_text(" ")
data = {'About': about, 'abstract': abstract, 'Status': project_status}
out.append(data)
df = pd.DataFrame(out)
print (df)
df.to_excel('PCORI_Results.xlsx')
答案 0 :(得分:1)
问题在于,每当您使用.get_text(" ")
时,您都会删除html。在这种情况下,它将剥离创建项目符号点的<ul>
和<li>
标签。