使用python 2.7从网页抓取文本

时间:2016-10-21 21:33:39

标签: python web-scraping

我正试图从这个网站抓取数据: Death Row Information

我无法从列表中所有已执行的违规者中删除最后一个语句,因为最后一个语句位于另一个HTML页面。 URL的名称构建如下:http://www.tdcj.state.tx.us/death_row/dr_info/[lastname][firstname].html。我无法想到如何从这些页面中删除最后的语句并将它们放在Sqlite数据库中。

所有其他信息(期望“罪犯信息”,我不需要)已经在我的数据库中。

任何可以给我一个指针的人都可以在Python中完成这项工作吗?

谢谢

Edit2:我有点进一步:

import sqlite3
import csv
import re
import urllib2
from urllib2 import Request, urlopen, URLError
from BeautifulSoup import BeautifulSoup
import requests
import string
URLS = []
Lastwords = {}

conn = sqlite3.connect('prison.sqlite')
conn.text_factory = str
cur = conn.cursor()

# Make some fresh tables using executescript()
cur.execute("DROP TABLE IF EXISTS prison")
cur.execute("CREATE TABLE Prison ( link1 text, link2 text,Execution text, LastName text, Firstname text, TDCJNumber text, Age integer, date text, race text, county text)")
conn.commit()


csvfile = open("prisonfile.csv","rb")
creader = csv.reader(csvfile, delimiter = ",")
for t in creader:
    cur.execute('INSERT INTO  Prison VALUES (?,?,?,?,?,?,?,?,?,?)', t, )

for column in cur.execute("SELECT LastName, Firstname FROM prison"):
    lastname = column[0].lower()
    firstname = column[1].lower()
    name = lastname+firstname
    CleanName = name.translate(None, ",.!-@'#$" "")
    CleanName2 = CleanName.replace(" ", "")
    Url = "http://www.tdcj.state.tx.us/death_row/dr_info/"
    Link = Url+CleanName2+"last.html"
    URLS.append(Link)
for URL in URLS:
    try:
        page = urllib2.urlopen(URL)
    except URLError, e:
        if e.code ==404:
            continue
    soup = BeautifulSoup(page.read())
    statements = soup.findAll ('p',{ "class" : "Last Statement:" })
    print statements

csvfile.close()
conn.commit()
conn.close()

我知道代码很混乱。一切正常后我会清理它。但有一个问题。我试图通过使用soup.findall获取所有语句,但我似乎无法使该类正确。页面源的相关部分如下所示:

<p class="text_bold">Last Statement:</p>
<p>I don't have anything to say, you can proceed Warden Jones.</p>

但是,我的程序输出:

[]
[]
[]

... 究竟是什么问题?

1 个答案:

答案 0 :(得分:0)

我不会编写能够解决问题的代码,但会为您提供一个如何自行完成的简单计划:

您知道每个最后一个语句都位于URL:

http://www.tdcj.state.tx.us/death_row/dr_info/[lastname][firstname]last.html

你说你已经掌握了所有其他信息。这可能包括已执行囚犯的名单。所以你应该在你的python代码中生成一个名单列表。这将允许您生成URL以访问您需要访问的每个页面。

然后使用我在上面发布的格式创建一个迭代每个URL的For循环。

在这个for循环的主体中,编写代码来读取页面并获取最后一个语句。每个页面上的最后一个语句在每个页面上采用相同的格式,因此您可以使用解析来捕获所需的部分:

<p class="text_bold">Last Statement:</p>
<p>D.J., Laurie, Dr. Wheat, about all I can say is goodbye, and for all the rest of you, although you don&rsquo;t forgive me for my transgressions, I forgive yours against me. I am ready to begin my journey and that&rsquo;s all I have to say.</p>

获得最后一个语句列表后,可以将它们推送到SQL。

所以你的代码看起来像这样:

import urllib2
# Make a list of names ('Last1First1','Last2First2','Last3First3',...)
names = #some_call_to_your_database
# Make a list of URLs to each inmate's last words page
# ('URL...Last1First1last.html',URL...Last2First2last.html,...)
URLS = () # made from the 'names' list above

# Create a dictionary to hold all the last words:
LastWords = {}

# Iterate over each individual page
for eachURL in URLS:
    response = urllib2.urlopen(eachURL)
    html = response.read()
    ## Some prisoners had no last words, so those URLs will 404.
    if ...: # Handle those 404s here

    ## Code to parse the response, hunting specifically
    ## for the code block I mentioned above. Once you have the
    ## last words as a string, save to dictionary:
    LastWords['LastFirst'] = "LastFirst's last words."



# Now LastWords is a dictionary with all the last words!
# Write some more code to push the content of LastWords
# to your SQL database.