我正试图从这个网站抓取数据: Death Row Information
我无法从列表中所有已执行的违规者中删除最后一个语句,因为最后一个语句位于另一个HTML页面。 URL的名称构建如下:http://www.tdcj.state.tx.us/death_row/dr_info/[lastname][firstname].html。我无法想到如何从这些页面中删除最后的语句并将它们放在Sqlite数据库中。
所有其他信息(期望“罪犯信息”,我不需要)已经在我的数据库中。
任何可以给我一个指针的人都可以在Python中完成这项工作吗?
谢谢
Edit2:我有点进一步:
import sqlite3
import csv
import re
import urllib2
from urllib2 import Request, urlopen, URLError
from BeautifulSoup import BeautifulSoup
import requests
import string
URLS = []
Lastwords = {}
conn = sqlite3.connect('prison.sqlite')
conn.text_factory = str
cur = conn.cursor()
# Make some fresh tables using executescript()
cur.execute("DROP TABLE IF EXISTS prison")
cur.execute("CREATE TABLE Prison ( link1 text, link2 text,Execution text, LastName text, Firstname text, TDCJNumber text, Age integer, date text, race text, county text)")
conn.commit()
csvfile = open("prisonfile.csv","rb")
creader = csv.reader(csvfile, delimiter = ",")
for t in creader:
cur.execute('INSERT INTO Prison VALUES (?,?,?,?,?,?,?,?,?,?)', t, )
for column in cur.execute("SELECT LastName, Firstname FROM prison"):
lastname = column[0].lower()
firstname = column[1].lower()
name = lastname+firstname
CleanName = name.translate(None, ",.!-@'#$" "")
CleanName2 = CleanName.replace(" ", "")
Url = "http://www.tdcj.state.tx.us/death_row/dr_info/"
Link = Url+CleanName2+"last.html"
URLS.append(Link)
for URL in URLS:
try:
page = urllib2.urlopen(URL)
except URLError, e:
if e.code ==404:
continue
soup = BeautifulSoup(page.read())
statements = soup.findAll ('p',{ "class" : "Last Statement:" })
print statements
csvfile.close()
conn.commit()
conn.close()
我知道代码很混乱。一切正常后我会清理它。但有一个问题。我试图通过使用soup.findall获取所有语句,但我似乎无法使该类正确。页面源的相关部分如下所示:
<p class="text_bold">Last Statement:</p>
<p>I don't have anything to say, you can proceed Warden Jones.</p>
但是,我的程序输出:
[]
[]
[]
... 究竟是什么问题?
答案 0 :(得分:0)
我不会编写能够解决问题的代码,但会为您提供一个如何自行完成的简单计划:
您知道每个最后一个语句都位于URL:
http://www.tdcj.state.tx.us/death_row/dr_info/[lastname][firstname]last.html
你说你已经掌握了所有其他信息。这可能包括已执行囚犯的名单。所以你应该在你的python代码中生成一个名单列表。这将允许您生成URL以访问您需要访问的每个页面。
然后使用我在上面发布的格式创建一个迭代每个URL的For循环。
在这个for循环的主体中,编写代码来读取页面并获取最后一个语句。每个页面上的最后一个语句在每个页面上采用相同的格式,因此您可以使用解析来捕获所需的部分:
<p class="text_bold">Last Statement:</p>
<p>D.J., Laurie, Dr. Wheat, about all I can say is goodbye, and for all the rest of you, although you don’t forgive me for my transgressions, I forgive yours against me. I am ready to begin my journey and that’s all I have to say.</p>
获得最后一个语句列表后,可以将它们推送到SQL。
所以你的代码看起来像这样:
import urllib2
# Make a list of names ('Last1First1','Last2First2','Last3First3',...)
names = #some_call_to_your_database
# Make a list of URLs to each inmate's last words page
# ('URL...Last1First1last.html',URL...Last2First2last.html,...)
URLS = () # made from the 'names' list above
# Create a dictionary to hold all the last words:
LastWords = {}
# Iterate over each individual page
for eachURL in URLS:
response = urllib2.urlopen(eachURL)
html = response.read()
## Some prisoners had no last words, so those URLs will 404.
if ...: # Handle those 404s here
## Code to parse the response, hunting specifically
## for the code block I mentioned above. Once you have the
## last words as a string, save to dictionary:
LastWords['LastFirst'] = "LastFirst's last words."
# Now LastWords is a dictionary with all the last words!
# Write some more code to push the content of LastWords
# to your SQL database.