解析HTML并将所有旧数据存储在SQLITE3中

时间:2018-08-14 04:19:43

标签: python sqlite beautifulsoup

下面,我创建了一个脚本,该脚本分析HTML,然后将列表分解为变量以分配给正确的列。我该怎么做,这样程序才能提取网站过去的所有数据,因此我不必每次都分配一个新的'a,b,y'变量?此外,如果有人可以帮助我分割日期(如果您查看解析的html文本,您会发现描述之前有一个日期,我只是作为y的占位符进行了测试)。

import bs4
import requests
from pprint import pprint
import sqlite3


def get_elems_from_document(document):
    pass

res = requests.get('http://www.sharkresearchcommittee.com/pacific_coast_shark_news.htm')
res.raise_for_status()
soup = bs4.BeautifulSoup(res.text, 'html.parser')

news = [p.text.strip() for p in soup.select('h1 ~ p') if p.find('font')]
a, b= (str(news[0]).split(" \xa0 —  \xa0 "))
y = 'test'
c = sqlite3.connect('shark.db')
try: ## if a table already existis, and you execute a create table an operational error will be thrown because it's trying to create another table that alrady exists(adding a new column after running could pose issue, essentially you have to delete db file and recreate it with new colum)
    c.execute('''CREATE TABLE mytable (
  Location        STRING,
  Date    STRING,
  Description             STRING            )''');
except sqlite3.OperationalError: #i.e. table exists already
    pass

c.execute('''INSERT INTO mytable(Location,Date,Description) VALUES(?,?,?)''',
          (a, y, b))
c.commit()
c.close()

1 个答案:

答案 0 :(得分:2)

您可以使用re来解析新闻。这段代码创建临时的:memory:sqlite数据库,并打印文章的所有位置,日期和简短摘录:

import re
import bs4
import sqlite3
import requests
import textwrap

res = requests.get('http://www.sharkresearchcommittee.com/pacific_coast_shark_news.htm')
res.raise_for_status()
soup = bs4.BeautifulSoup(res.text, 'html.parser')

news = [p.text.strip() for p in soup.select('h1 ~ p') if p.find('font')]

with sqlite3.connect(":memory:") as conn:
    c = conn.cursor()
    c.execute('''CREATE TABLE
                    mytable (Location        STRING,
                             Date            STRING,
                             Description     STRING)''')

    for n in news:
        groups = re.match(r'(.*?)\W+—?\W+On\W+(.*?\d{4})\W*(.*)', n, flags=re.DOTALL)
        if not groups:
            continue
        place, date, article = groups[1], groups[2], groups[3]

        c.execute('''INSERT INTO mytable(Location, Date, Description) VALUES(?,?,?)''',
            (place, date, article))
    conn.commit()

    # print the data back:
    c.execute('''SELECT * FROM mytable''')

    for place, date, article in c:
        print('{} -- {}'.format(place, date))
        print(textwrap.shorten(article, width=70))
        print('*' * 80)

打印:

Shell Beach -- August 1, 2018
Kristen Sanchez was paddling an outrigger with two companions [...]
********************************************************************************
Monterey Bay -- August 1, 2018
Eric Keener was spearfishing for California Halibut, [...]
********************************************************************************
Pacifica -- July 27, 2018
Kris Lopez was surfing with 4 unidentified surfers at Pacifica [...]
********************************************************************************
Santa Monica -- July 26, 2018
Tim O’Leary was surfing between lifeguard towers 29 and 30 in [...]
********************************************************************************
Ventura -- July 23, 2018
Victor Malfonado was surfing at Rincon Beach 3 miles East of [...]
********************************************************************************

... and so on.