Python web抓取教程。连接MySQL数据库问题

时间:2013-07-30 10:11:11

标签: python mysql web connection screen-scraping

我正在从Chris Reeves的一系列教程中学习网络抓取技术。非常好的东西,你应该看看。

我遇到了来自tutorial no. 10的示例问题,其中Chris解释了与mySQL数据库的连接。首先我遇到的问题是没有将值提交到数据库中的表。然后在评论中我发现我遗漏了conn.commit()视频作者未包含在他的程序中。我已经将这部分代码添加到我的程序中,这很有效,现在它看起来像这样:

from threading import Thread
import urllib
import re
import MySQLdb

conn = MySQLdb.connect(host="127.0.0.1",port=3307,user="root",passwd="root",db="stock_data")

query = "INSERT INTO tutorial (symbol) values('AAPL')"
x = conn.cursor()
x.execute(query)
conn.commit()
row = x.fetchall()

它连接到我的本地数据库,并成功地将 AAPL 添加到符号列下的表教程

我的问题始于Chris教程的第二部分,您可以在其中添加多线程部分代码,该代码从外部.txt文件中读取4字母符号,并将所有内容添加到同一数据库中。

现在我的程序看起来像这样:

from threading import Thread
import urllib
import re
import MySQLdb

gmap = {}

def th(ur):
    base = "http://finance.yahoo.com/q?s="+ur
    regex = '<span id="yfs_l84_'+ur.lower()+'">(.+?)</span>'
    pattern = re.compile(regex)
    htmltext = urllib.urlopen(base).read()
    results = re.findall(pattern,htmltext)
    try:
        gmap[ur] = results [0]
    except:
        print "got an error"

symbolslist = open("threads/symbols.txt").read()
symbolslist = symbolslist.replace(" ","").split(",")

print symbolslist

threadlist = []

for u in symbolslist:
    t = Thread(target=th,args=(u,))
    t.start()
    threadlist.append(t)

for b in threadlist:
    b.join()

conn = MySQLdb.connect(host="127.0.0.1",port=3307,user="root",passwd="root",db="stock_data")

for key in gmap.keys():
    print key,gmap[key]
    query = "INSERT INTO tutorial (symbol,last) values("
    query = query+"'"+key+"',"+gmap[key]+")"
    x = conn.cursor()
    x.execute(query)
    conn.commit()
    row = x.fetchall()

几乎与Chris example完全一样(除了我不使用外部登录数据,但直接在代码中,但这不是问题),我收到所有线程的错误,它们看起来像这样:

Exception in thread Thread-474:
Traceback (most recent call last):
  File "C:\Python27\lib\threading.py", line 810, in __bootstrap_inner
    self.run()
  File "C:\Python27\lib\threading.py", line 763, in run
    self.__target(*self.__args, **self.__kwargs)
  File "threads/threads2.py", line 12, in th
    htmltext = urllib.urlopen(base).read()
  File "C:\Python27\lib\urllib.py", line 87, in urlopen
    return opener.open(url)
  File "C:\Python27\lib\urllib.py", line 208, in open
    return getattr(self, name)(url)
  File "C:\Python27\lib\urllib.py", line 345, in open_http
    h.endheaders(data)
  File "C:\Python27\lib\httplib.py", line 969, in endheaders
    self._send_output(message_body)
  File "C:\Python27\lib\httplib.py", line 829, in _send_output
    self.send(msg)
  File "C:\Python27\lib\httplib.py", line 791, in send
    self.connect()
  File "C:\Python27\lib\httplib.py", line 772, in connect
    self.timeout, self.source_address)
  File "C:\Python27\lib\socket.py", line 571, in create_connection
    raise err
IOError: [Errno socket error] [Errno 10060] A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond

正如我所说,对于Thread-474,这只是一个错误,但我在IDE中为多个线程获取它,对于Thread-441,Thread-390,Thread-391等......

我错过了什么?它是在代码或我的MySql服务器设置中的东西?因为根据Chris示例中的所有内容,它应该可以正常工作

帮助任何人?

2 个答案:

答案 0 :(得分:0)

您的主题正在尝试访问网站,并且与数据库无关;因此,您的问题不在于您的数据库的设置(并且您已经尝试并确认它可以正常工作),但是您的互联网连接。

您确定您拥有网络连接并设置了正确的代理等吗?

答案 1 :(得分:0)

似乎我遇到套接字超时问题....

我添加了

timeout = 10
socket.setdefaulttimeout(timeout)

在定义之前它按预期工作! :)