Question

下面，我创建了一个脚本，该脚本分析HTML，然后将列表分解为变量以分配给正确的列。我该怎么做，这样程序才能提取网站过去的所有数据，因此我不必每次都分配一个新的'a，b，y'变量？此外，如果有人可以帮助我分割日期（如果您查看解析的html文本，您会发现描述之前有一个日期，我只是作为y的占位符进行了测试）。

import bs4
import requests
from pprint import pprint
import sqlite3


def get_elems_from_document(document):
    pass

res = requests.get('http://www.sharkresearchcommittee.com/pacific_coast_shark_news.htm')
res.raise_for_status()
soup = bs4.BeautifulSoup(res.text, 'html.parser')

news = [p.text.strip() for p in soup.select('h1 ~ p') if p.find('font')]
a, b= (str(news[0]).split(" \xa0 —  \xa0 "))
y = 'test'
c = sqlite3.connect('shark.db')
try: ## if a table already existis, and you execute a create table an operational error will be thrown because it's trying to create another table that alrady exists(adding a new column after running could pose issue, essentially you have to delete db file and recreate it with new colum)
    c.execute('''CREATE TABLE mytable (
  Location        STRING,
  Date    STRING,
  Description             STRING            )''');
except sqlite3.OperationalError: #i.e. table exists already
    pass

c.execute('''INSERT INTO mytable(Location,Date,Description) VALUES(?,?,?)''',
          (a, y, b))
c.commit()
c.close()

Answer 1

您可以使用re来解析新闻。这段代码创建临时的：memory：sqlite数据库，并打印文章的所有位置，日期和简短摘录：

import re
import bs4
import sqlite3
import requests
import textwrap

res = requests.get('http://www.sharkresearchcommittee.com/pacific_coast_shark_news.htm')
res.raise_for_status()
soup = bs4.BeautifulSoup(res.text, 'html.parser')

news = [p.text.strip() for p in soup.select('h1 ~ p') if p.find('font')]

with sqlite3.connect(":memory:") as conn:
    c = conn.cursor()
    c.execute('''CREATE TABLE
                    mytable (Location        STRING,
                             Date            STRING,
                             Description     STRING)''')

    for n in news:
        groups = re.match(r'(.*?)\W+—?\W+On\W+(.*?\d{4})\W*(.*)', n, flags=re.DOTALL)
        if not groups:
            continue
        place, date, article = groups[1], groups[2], groups[3]

        c.execute('''INSERT INTO mytable(Location, Date, Description) VALUES(?,?,?)''',
            (place, date, article))
    conn.commit()

    # print the data back:
    c.execute('''SELECT * FROM mytable''')

    for place, date, article in c:
        print('{} -- {}'.format(place, date))
        print(textwrap.shorten(article, width=70))
        print('*' * 80)

打印：

Shell Beach -- August 1, 2018
Kristen Sanchez was paddling an outrigger with two companions [...]
********************************************************************************
Monterey Bay -- August 1, 2018
Eric Keener was spearfishing for California Halibut, [...]
********************************************************************************
Pacifica -- July 27, 2018
Kris Lopez was surfing with 4 unidentified surfers at Pacifica [...]
********************************************************************************
Santa Monica -- July 26, 2018
Tim O’Leary was surfing between lifeguard towers 29 and 30 in [...]
********************************************************************************
Ventura -- July 23, 2018
Victor Malfonado was surfing at Rincon Beach 3 miles East of [...]
********************************************************************************

... and so on.

解析HTML并将所有旧数据存储在SQLITE3中

1 个答案: