下面,我创建了一个脚本,该脚本分析HTML,然后将列表分解为变量以分配给正确的列。我该怎么做,这样程序才能提取网站过去的所有数据,因此我不必每次都分配一个新的'a,b,y'变量?此外,如果有人可以帮助我分割日期(如果您查看解析的html文本,您会发现描述之前有一个日期,我只是作为y的占位符进行了测试)。
import bs4
import requests
from pprint import pprint
import sqlite3
def get_elems_from_document(document):
pass
res = requests.get('http://www.sharkresearchcommittee.com/pacific_coast_shark_news.htm')
res.raise_for_status()
soup = bs4.BeautifulSoup(res.text, 'html.parser')
news = [p.text.strip() for p in soup.select('h1 ~ p') if p.find('font')]
a, b= (str(news[0]).split(" \xa0 — \xa0 "))
y = 'test'
c = sqlite3.connect('shark.db')
try: ## if a table already existis, and you execute a create table an operational error will be thrown because it's trying to create another table that alrady exists(adding a new column after running could pose issue, essentially you have to delete db file and recreate it with new colum)
c.execute('''CREATE TABLE mytable (
Location STRING,
Date STRING,
Description STRING )''');
except sqlite3.OperationalError: #i.e. table exists already
pass
c.execute('''INSERT INTO mytable(Location,Date,Description) VALUES(?,?,?)''',
(a, y, b))
c.commit()
c.close()
答案 0 :(得分:2)
您可以使用re
来解析新闻。这段代码创建临时的:memory:sqlite数据库,并打印文章的所有位置,日期和简短摘录:
import re
import bs4
import sqlite3
import requests
import textwrap
res = requests.get('http://www.sharkresearchcommittee.com/pacific_coast_shark_news.htm')
res.raise_for_status()
soup = bs4.BeautifulSoup(res.text, 'html.parser')
news = [p.text.strip() for p in soup.select('h1 ~ p') if p.find('font')]
with sqlite3.connect(":memory:") as conn:
c = conn.cursor()
c.execute('''CREATE TABLE
mytable (Location STRING,
Date STRING,
Description STRING)''')
for n in news:
groups = re.match(r'(.*?)\W+—?\W+On\W+(.*?\d{4})\W*(.*)', n, flags=re.DOTALL)
if not groups:
continue
place, date, article = groups[1], groups[2], groups[3]
c.execute('''INSERT INTO mytable(Location, Date, Description) VALUES(?,?,?)''',
(place, date, article))
conn.commit()
# print the data back:
c.execute('''SELECT * FROM mytable''')
for place, date, article in c:
print('{} -- {}'.format(place, date))
print(textwrap.shorten(article, width=70))
print('*' * 80)
打印:
Shell Beach -- August 1, 2018
Kristen Sanchez was paddling an outrigger with two companions [...]
********************************************************************************
Monterey Bay -- August 1, 2018
Eric Keener was spearfishing for California Halibut, [...]
********************************************************************************
Pacifica -- July 27, 2018
Kris Lopez was surfing with 4 unidentified surfers at Pacifica [...]
********************************************************************************
Santa Monica -- July 26, 2018
Tim O’Leary was surfing between lifeguard towers 29 and 30 in [...]
********************************************************************************
Ventura -- July 23, 2018
Victor Malfonado was surfing at Rincon Beach 3 miles East of [...]
********************************************************************************
... and so on.