网络蜘蛛没有返回所有结果

时间:2014-04-21 15:24:12

标签: python mysql sql scrapy

如果你看here我无法让两个不同的蜘蛛自动将结果添加到mysql数据库中。现在我添加了一个if和elif语句并且它们有效,但是它们错过了一些结果,之前在浴桌上有52行,现在只有41个。布里斯托尔曾经有154个现在只有141个。我想不到为什么结果不一样。

Pipelines.py

import sys
import MySQLdb
import MySQLdb.cursors
import hashlib
from scrapy.exceptions import DropItem
from scrapy.http import Request

class TestPipeline(object):

def __init__(self):
    self.conn = MySQLdb.connect(
        user='user',
        passwd='pwd',
        db='db',
        host='host',
        charset='utf8',
        use_unicode=True
        )
    self.cursor = self.conn.cursor()

def process_item(self, item, spider):
    try:
        if 'BristolQualification' in item:
            self.cursor.execute("""INSERT INTO Bristol(BristolCountry, BristolQualification) VALUES ('{0}', '{1}')""".format(item['BristolCountry'], "".join([s.encode('utf8') for s in item['BristolQualification']])))
        elif 'BathQualification' in item:
            self.cursor.execute("""INSERT INTO Bath(BathCountry, BathQualification) VALUES ('{0}', '{1}')""".format(item['BathCountry'], "".join([s.encode('utf8') for s in item['BathQualification']])))
        self.conn.commit()
        return item

    except MySQLdb.Error as e:
        print "Error %d: %s" % (e.args[0], e.args[1])

Items.py

from scrapy.item import Item, Field

class QualificationItem(Item):
BristolQualification = Field()
BristolCountry = Field()
BathQualification = Field()
BathCountry = Field()

Bristol.py

from scrapy.spider import BaseSpider
from project.items import QualificationItem
from scrapy.selector import HtmlXPathSelector
from scrapy.http.request import Request
from urlparse import urljoin

USER_AGENT = 'Mozilla/5.0 (X11; Linux x86_64; rv:27.0) Gecko/20100101 Firefox/27.0'

class recursiveSpider(BaseSpider):
name = 'bristol'
allowed_domains = ['bristol.ac.uk/']
start_urls = ['http://www.bristol.ac.uk/international/countries/']

def parse(self, response):
    hxs = HtmlXPathSelector(response)

    xpath = '//*[@id="all-countries"]/li/ul/li/a/@href'
    a_of_the_link = '//*[@id="all-countries"]/li/ul/li/a/text()'
    for text, link in zip(hxs.select(a_of_the_link).extract(), hxs.select(xpath).extract()):
        yield Request(urljoin(response.url, link),
        meta={'a_of_the_link': text},
        headers={'User-Agent': USER_AGENT},
        callback=self.parse_linkpage,
        dont_filter=True)

def parse_linkpage(self, response):
    hxs = HtmlXPathSelector(response)
    item = QualificationItem()
    xpath = """
            //h2[normalize-space(.)="Entry requirements for undergraduate courses"]
             /following-sibling::p[not(preceding-sibling::h2[normalize-space(.)!="Entry requirements for undergraduate courses"])]
            """
    item['BristolQualification'] = hxs.select(xpath).extract()[1:]
    item['BristolCountry'] = response.meta['a_of_the_link']
    return item

如果你看here,用户确实试图解决问题但是不成功,我从那时起就没有听到过。

  

'这些错误是由BristolQualification项目字段中未转义的单引号引起的(并且可能是Bath蜘蛛遭受同样的问题)造成了严重破坏(例如下面代码段中的d'练习曲):&# 39;

这就是他认为的问题。

任何人都可以看到问题所在吗?

0 个答案:

没有答案