我希望获取某些网页文本的所有相关文本部分,并将其解析为结构化格式,例如CSV文件供以后使用。 但是,我要从中获取信息的网页并不严格遵循相同的格式,例如页面:
http://www.cs.bham.ac.uk/research/groupings/machine-learning/
http://www.cs.bham.ac.uk/research/groupings/robotics/
http://www.cs.bham.ac.uk/research/groupings/reasoning/
我一直在使用BeautifulSoup,对于遵循明确定义格式的网页来说,这很好,但是这些特定网站并未遵循标准格式。
如何编写代码以从这些页面中提取主要文本?
我可以提取所有文本并删除不相关/常见的文本吗?
还是我可以以某种方式选择这些较大的文本主体,即使它们不是统一出现的?
网站的格式有所不同,但并非如此复杂,以至于我认为这是不可能的?
最初,我有这样的代码来处理结构化页面:
from urllib.request import urlopen
from bs4 import BeautifulSoup
import sqlite3
conn = sqlite3.connect('/Users/tom/PycharmProjects/tmc765/Parsing/MScProject.db')
c = conn.cursor()
### Specify URL
programme_list = ["http://www.cs.bham.ac.uk/internal/programmes/2017/0144",
"http://www.cs.bham.ac.uk/internal/programmes/2017/9502",
"http://www.cs.bham.ac.uk/internal/programmes/2017/452B",
"http://www.cs.bham.ac.uk/internal/programmes/2017/4436",
"http://www.cs.bham.ac.uk/internal/programmes/2017/5914",
"http://www.cs.bham.ac.uk/internal/programmes/2017/9503",
"http://www.cs.bham.ac.uk/internal/programmes/2017/9499",
"http://www.cs.bham.ac.uk/internal/programmes/2017/5571",
"http://www.cs.bham.ac.uk/internal/programmes/2017/5955",
"http://www.cs.bham.ac.uk/internal/programmes/2017/4443",
"http://www.cs.bham.ac.uk/internal/programmes/2017/9509",
"http://www.cs.bham.ac.uk/internal/programmes/2017/5576",
"http://www.cs.bham.ac.uk/internal/programmes/2017/9501",
"http://www.cs.bham.ac.uk/internal/programmes/2017/4754",
"http://www.cs.bham.ac.uk/internal/programmes/2017/5196"]
for programme_page in programme_list:
# Query page, return html to a variable
page = urlopen(programme_page)
soupPage = BeautifulSoup(page, 'html.parser')
name_box = soupPage.find('h1')
Programme_Identifier = name_box.text.strip()
Programme_Award = soupPage.find("td", text="Final Award").find_next_sibling("td").text
Interim_Award = soupPage.find("td", text="Interim Award")
if Interim_Award is not None:
Interim_Award = Interim_Award = soupPage.find("td", text="Interim Award").find_next_sibling("td").text
Programme_Title = soupPage.find("td", text="Programme Title").find_next_sibling("td").text
School_Department = soupPage.find("td", text="School/Department").find_next_sibling("td").text
Banner_Code = soupPage.find("td", text="Banner Code").find_next_sibling("td").text
Programme_Length = soupPage.find("td", text="Length of Programme").find_next_sibling("td").text
Total_Credits = soupPage.find("td", text="Total Credits").find_next_sibling("td").text
UCAS_Code = soupPage.find("td", text="UCAS Code").find_next_sibling("td").text
Awarding_Institution = soupPage.find("td", text="Awarding Institution").find_next_sibling("td").text
QAA_Benchmarking_Groups = soupPage.find("td", text="QAA Benchmarking Groups").find_next_sibling("td").text
#SQL code for inserting into database
with conn:
c.execute("INSERT INTO Programme_Pages VALUES (?,?,?,?,?,?,?,?,?,?,?,?)",
(Programme_Identifier, Programme_Award, Interim_Award, Programme_Title,
School_Department, Banner_Code, Programme_Length, Total_Credits,
UCAS_Code, Awarding_Institution, QAA_Benchmarking_Groups, programme_page))
print("Program Title: ", Programme_Identifier)
print("Program Award: ", Programme_Award)
print("Interim Award: ", Interim_Award)
print("Program Title: ", Programme_Title)
print("School/Department: ", School_Department)
print("Banner Code: ", Banner_Code)
print("Length of Program: ", Programme_Length)
print("Total Credits: ", Total_Credits)
print("UCAS Code: ", UCAS_Code)
print("Awarding Institution: ", Awarding_Institution)
print("QAA Benchmarking Groups: ", QAA_Benchmarking_Groups)
print("~~~~~~~~~~\n~~~~~~~~~~")
Educational_Aims = soupPage.find('div', {"class": "programme-text-block"})
Educational_Aims_Title = Educational_Aims.find('h2')
Educational_Aims_Title = Educational_Aims_Title.text.strip()
Educational_Aims_List = Educational_Aims.findAll("li")
print(Educational_Aims_Title)
for el in Educational_Aims_List:
text = el.text.strip()
with conn:
c.execute("INSERT INTO Programme_Info VALUES (?,?,?,?)", (Programme_Identifier, text,
Educational_Aims_Title, programme_page))
print(el.text.strip())
但是,我还没有找到一种方法来编写脚本以从上面链接的非结构化页面中提取相关文本。 我正在考虑尝试拉出所有标记为
的部分,然后在它们出现时进行处理。 我只是以为某人可能对更简单的方法有任何见识。
答案 0 :(得分:0)
在我看来,结构看起来很漂亮:-这些页面
<ul>
或一堆<p>
或混合在一起,您可以检查所有内容最后,要以正确的顺序打印所有段落,您必须查看一下从上到下的元素以获取更多信息,请查看此问题 [Access next sibling]
答案 1 :(得分:0)
这取决于您要提取哪种信息。在我的示例中,我提取了标题,文本和人员列表(如果存在)。您可以添加其他解析规则以提取更多信息:
urls = ['http://www.cs.bham.ac.uk/research/groupings/machine-learning/',
'http://www.cs.bham.ac.uk/research/groupings/robotics/',
'http://www.cs.bham.ac.uk/research/groupings/reasoning/']
from bs4 import BeautifulSoup
import requests
from pprint import pprint
for url in urls:
soup = BeautifulSoup(requests.get(url).text, 'lxml')
# parse title
title = soup.select_one('h1.title').text
# parse academic staff (if any):
staff_list = []
if soup.select('h2 ~ ul'):
for li in soup.select('h2 ~ ul')[-1].find_all('li'):
staff_list.append(li.text)
li.clear()
soup.select('h2')[-1].clear()
# parse the text
text = ''
for t in soup.select('nav ~ *'):
text += t.text.strip() + '\n'
print(title)
print(text)
print('Staff list = ', staff_list)
print('-' * 80)
将打印(缩写):
Intelligent Robotics Lab
Welcome to the Intelligent Robotics Lab in the School of Computer Science at the University of Birmingham. ...
Staff list = []
--------------------------------------------------------------------------------
Reasoning
Overview
This grouping includes research on various forms of reasoning, including theorem proving and uncertain reasoning, with particular application to mathematical knowledge management, mathematical document recognition, computer algebra, natural language processing, and multi-attribute and multi-agent decision-making. The research is relevant both to understanding how human reasoning works and to designing useful practical tools...
Staff list = ['John Barnden', 'Richard Dearden', 'Antoni Diller', 'Manfred Kerber', 'Mark Lee', 'Xudong Luo', 'Alan Sexton', 'Volker Sorge']
--------------------------------------------------------------------------------