Web抓取研究摘要-维护输出中项目要点的问题

时间:2019-07-15 20:58:26

标签: python pandas text web-scraping beautifulsoup

我正在网上抓取各种研究摘要并创建数据集。当我尝试为PCORI摘要执行此操作时,我可以得到所需的内容,但是当文本中有项目符号要点时,项目符号要点:

  1. 不在我的输出中,并且
  2. 与项目符号点相关的缩进也不是

我是一个新手,虽然我确实在寻找其他代码,但是令人惊讶的是找不到其他人遇到相同的问题。我正在使用的示例是:https://www.pcori.org/research-results/2013/testing-new-ways-schedule-appointments-community-health-centers-help-patients

import requests
from bs4 import BeautifulSoup
import re
import pandas as pd

out = []

urlsummary ='https://www.pcori.org/research-results/2013/testing-new-ways- 
schedule-appointments-community-health-centers-help-patients'
html = requests.get(urlsummary).content
soup = BeautifulSoup(html, 'lxml')

abstract = soup.find(class_='pane pane--node').get_text(" ")
about = abstract.split('What was the research about?')[1]
project_status = soup.find(class_='field field-name-field-award- 
status').get_text(" ")


data = {'About': about, 'abstract': abstract, 'Status': project_status}
out.append(data)
df = pd.DataFrame(out)
print (df)

df.to_excel('PCORI_Results.xlsx')

1 个答案:

答案 0 :(得分:1)

问题在于,每当您使用.get_text(" ")时,您都会删除html。在这种情况下,它将剥离创建项目符号点的<ul><li>标签。