无法从网站下载某些文本

时间:2016-10-29 10:43:15

标签: python sqlite

我正在尝试从Death Row Website下载所有最后一个语句。基本大纲是这样的 1.来自站点的信息被导入sqlite数据库,pris.sqlite 2.根据表中的名称,我为每个名称生成唯一的URL,以获取它们的最后一个语句。 3.程序检查每个生成的URL,如果URL正常,则检查最后一个语句。这个语句被下载到数据库prison.sqlite(仍然是2)。

这是我的代码:

import sqlite3
import csv
import re
import urllib2
from urllib2 import Request, urlopen, URLError
from BeautifulSoup import BeautifulSoup
import requests
import string
URLS = ["http://www.tdcj.state.tx.us/death_row/dr_info/hernandezramontorreslast.html",
    "http://www.tdcj.state.tx.us/death_row/dr_info/garciafrankmlast.html",
    "http://www.tdcj.state.tx.us/death_row/dr_info/martinezdavidlast999173.html",
    "http://www.tdcj.state.tx.us/death_row/dr_info/moselydaroycelast.html",
    "http://www.tdcj.state.tx.us/death_row/dr_info/martinezdavidlast999288.html",
    "http://www.tdcj.state.tx.us/death_row/dr_info/hernandezadophlast.html",
    "http://www.tdcj.state.tx.us/death_row/dr_info/carterrobertanthonylast.html",
    "http://www.tdcj.state.tx.us/death_row/dr_info/livingstoncharleslast.html",
    "http://www.tdcj.state.tx.us/death_row/dr_info/gentrykennethlast.html",
    "http://www.tdcj.state.tx.us/death_row/dr_info/gentrykennethlast.html",
    "http://www.tdcj.state.tx.us/death_row/dr_info/wilkersonrichardlast.html",
    "http://www.tdcj.state.tx.us/death_row/dr_info/hererraleonellast.html",]

conn = sqlite3.connect('prison.sqlite')
conn.text_factory = str
cur = conn.cursor()

cur.execute("DROP TABLE IF EXISTS prison")
cur.execute("CREATE TABLE Prison ( Execution text, link1 text, Statements text, LastName text, Firstname text, TDCJNumber text, Age integer, date text, race text, county text)")
conn.commit()


csvfile = open("prisonfile.csv","rb")
creader = csv.reader(csvfile, delimiter = ",")
for t in creader:
    cur.execute('INSERT INTO  Prison VALUES (?,?,?,?,?,?,?,?,?,?)', t, )

for column in cur.execute("SELECT LastName, Firstname FROM prison"):
    lastname = column[0]
    firstname = column[1]
    name = lastname+firstname
    CleanName = name.translate(None, ",.!-@'#$" "")
    CleanName = CleanName.replace(" ", "")
    CleanName = CleanName.replace("III","")
    CleanName = re.sub("Sr","",CleanName)
    CleanName = re.sub("Jr","",CleanName)
    CleanName = CleanName.lower()
    Baseurl = "http://www.tdcj.state.tx.us/death_row/dr_info/"
    Link = Baseurl+CleanName+"last.html"
    URLS.append(Link)


    for Link in URLS:
        try:
            r = requests.get(Link)
            r.raise_for_status()
            print "URL OK", Link
            document = urllib2.urlopen(Link)
            html = document.read()
            soup = BeautifulSoup(html)
            Statement = soup.find(text="Last Statement:").findNext('p').contents[0]
            print Statement
            continue
        except requests.exceptions.HTTPError as err:
            print err
            print "Offender has made no statement.", Link
            #cur.execute("INSERT OR IGNORE INTO prison(Statements) VALUES(?)"), (Statement, )

csvfile.close()
conn.commit()
conn.close()

运行程序时,我得到:

C:\python>prison.py
URL OK http://www.tdcj.state.tx.us/death_row/dr_info/hernandezramontorreslast.html
Can you hear me? Did I ever tell you, you have dad's eyes? I've noticed that in the last couple of days. I'm sorry for putting you through all this. Tell everyone I love them. It was good seeing the kids. I love them all; tell mom, everybody. I am very sorry for all of the pain. Tell Brenda I love her. To everybody back on the row, I know you're going through a lot over there. Keep fighting, don't give up everybody.
URL OK http://www.tdcj.state.tx.us/death_row/dr_info/garciafrankmlast.html
Thank you, Jesus Christ. Thank you for your blessing. You are above the president. And know it is you, Jesus Christ, that is performing this miracle in my life. Hallelujah, Holy, Holy, Holy. For this reason I was born and raised. Thank you for this, my God is a God of Salvation. Only through you, Jesus Christ, people will see that you're still on the throne. Hallelujah, Holy, Holy, Holy. I invoke Your name. Thank you, Yahweh, thank you Jesus Christ. Hallelujah, Amen. Thank you, Warden.
URL OK http://www.tdcj.state.tx.us/death_row/dr_info/martinezdavidlast999173.html
Traceback (most recent call last):
  File "C:\python\prison.py", line 60, in <module>
    Statement = soup.find(text="Last Statement:").findNext('p').contents[0]
AttributeError: 'NoneType' object has no attribute 'findNext'

前两个语句没问题,但在程序崩溃之后。查看发生错误的URL的页面源代码,我看到: (只有相关数据)

<div class="return_to_div"></div>
<h1>Offender Information</h1>
<h2>Last Statement</h2>
<p class="text_bold">Date of Execution:</p>
<p> February 4, 2009</p>
<p class="text_bold"> Offender:</p>
<p> Martinez, David</p>
<p class="text_bold"> Last Statement:</p>
<p> Yes, nothing I can say can change the past. I am asking for forgiveness. Saying sorry is not going to change anything. I hope one day you can find peace. I am sorry for all of the pain that I have caused you for all those years. There is nothing else I can say, that can help you. Mija, I love you. Sis, Cynthia, and Sandy, keep on going and it will be O.K. I am sorry to put you through this as well. I can't change the past. I hope you find peace and know that I love you. I am sorry. I am sorry and I can't change it.  </p>

可能导致此问题的原因。我必须在这一行改变一些东西吗?:

Statement = soup.find(text="Last Statement:").findNext('p').contents[0]

随意分享对我的代码的改进。现在我想让一切正常,然后才能让它变得更强大。

对于想知道带有URL的列表的人们:这是由于死囚区网站上的一些错误。有时URL与[lastname] [firstname] last.html不同。我现在手动添加它们。

0 个答案:

没有答案