Question

我对Scrapy库很陌生，我在为蜘蛛奋斗。我正在尝试从此网站https://murderpedia.org/male.A/index.A.htm抓取数据

我想做的是页面上的每个链接，我想跟随该链接并刮擦图像以及文本[第3-11行]。

在这里的任何帮助将不胜感激。

这是我的代码：

from scrapy.spiders import Request
from scrapy.linkextractors import LinkExtractor
from scrapy.http import HtmlResponse  
import re


BASE_URL = 'http://murderpedia.org/'
PROTOCOL = 'https:'


class SerialKillerItem(scrapy.Item):

    name = scrapy.Field()
    bio = scrapy.Field()
    images = scrapy.Field()
    link = scrapy.Field()
    image_urls = scrapy.Field()
    bio_image = scrapy.Field() 

    classification = scrapy.Field()
    characteristics = scrapy.Field()
    number_of_victims = scrapy.Field()
    date_of_murders = scrapy.Field()
    date_of_birth = scrapy.Field()
    victims_profile = scrapy.Field()
    method_of_murder = scrapy.Field()
    location = scrapy.Field()
    status = scrapy.Field()


class SerialKillerBio(scrapy.Spider): 

    name = 'serial_killer_bio'
    start_urls = ['http://murderpedia.org/male.A/index.A.htm']

    def parse(self, response):

        images = response.css("#AutoNumber3 > tbody > tr:nth-child(2) 
        > td > font:nth-child(1) > div > center > table:nth-child(2) > 
        tbody > tr > td > font > div > table > tbody > tr > td:nth- 
        child(2) > p > img::attr(src)").extract_first()

        for row in response.css('#table4 > tbody'): 

            text = {
            'Classification' : row.css('tr[3]::text').extract_first(),
            'Characteristics': row.css('tr[4]::text').extract_first(),
            'Number of 
            Victims':row.css('tr[5]::text').extract_first(),
            'Date of Murders': row.css('tr[6]::text').extract_first(),
            'Date of Birth': row.xpath('tr[7]::text').extract_first(), 
            'Victims Profile': row.xpath('tr[8] 
            ::text').extract_first(), 
            'Method of Murder': row.xpath('tr[9] 
            ::text').extract_first(),  
            'Location' : row.css('tr[10] ::text').extract_first(),
            'Status' : row.css('tr[11] ::text').extract_first()} 

            text2 = ''.join(text) 

            print(text2)

            if images:

                yield {'text2': 
                SerialKillerItem(classification=name['Classification'], 
                        characteristics=name['Characteristics'], 
                        number_of_victims=name['Number of 
                        Victims'], 
                        date_of_murders=name['Date of Murders'], 
                        date_of_birth=name['Date of Birth'],
                        victims_profile=name['Victims Profile'], 
                        method_of_murder=name['Method of Murder'], 
                        location=name['Location'],
                        status=name['Status']), 
                        'image_urls': [PROTOCOL+ images][:10]}

            else:

                yield {'text2': 
                SerialKillerItem(classification=name['Classification'], 
                        characteristics=name['Characteristics'], 
                        number_of_victims=name['Number of 
                        Victims'], 
                        date_of_murders=name['Date of Murders'], 
                        date_of_birth=name['Date of Birth'],
                        victims_profile=name['Victims Profile'], 
                        method_of_murder=name['Method of Murder'], 
                        location=name['Location'],
                        status=name['Status']), 'image_urls':[]}

                for next_page in response.css('#table2 > tbody > 
                tr:nth-child(2) > td > font:nth-child(1) > div > table 
                > tbody > tr > td:nth-child(2) > p > font > font > 
                a::attr(href)').extract():

                    print(BASE_URL + next_page)
                    yield Request(BASE_URL + next_page, \
                    callback=self.parse)

这是抓取日志：

2018-10-24 21:11:04 [scrapy.utils.log] INFO: Scrapy 1.5.1 started 
(bot: serial_killers)
2018-10-24 21:11:04 [scrapy.utils.log] INFO: Versions: lxml 4.2.3.0, 
libxml2 2.9.8, cssselect 1.0.3, parsel 1.5.0, w3lib 1.19.0, Twisted 
18.9.0, Python 3.6.5 (default, Apr 25 2018, 14:22:56) - [GCC 4.2.1 
Compatible Apple LLVM 8.0.0 (clang-800.0.42.1)], pyOpenSSL 18.0.0 
(OpenSSL 1.1.0h  27 Mar 2018), cryptography 2.2.2, Platform Darwin- 
15.2.0-x86_64-i386-64bit
2018-10-24 21:12:19 [scrapy.utils.log] INFO: Scrapy 1.5.1 started 
(bot: serial_killers)
2018-10-24 21:12:19 [scrapy.utils.log] INFO: Versions: lxml 4.2.3.0, 
libxml2 2.9.8, cssselect 1.0.3, parsel 1.5.0, w3lib 1.19.0, Twisted 
18.9.0, Python 3.6.5 (default, Apr 25 2018, 14:22:56) - [GCC 4.2.1 
Compatible Apple LLVM 8.0.0 (clang-800.0.42.1)], pyOpenSSL 18.0.0 
(OpenSSL 1.1.0h  27 Mar 2018), cryptography 2.2.2, Platform Darwin- 
15.2.0-x86_64-i386-64bit
2018-10-24 21:12:19 [scrapy.crawler] INFO: Overridden settings: 
{'BOT_NAME': 'serial_killers', 'FEED_EXPORT_ENCODING': 'utf-8', 
'HTTPCACHE_ENABLED': True, 'LOG_FILE': 'output.log', 
'NEWSPIDER_MODULE': 'serial_killers.spiders', 'ROBOTSTXT_OBEY': True, 
'SPIDER_MODULES': ['serial_killers.spiders']}
2018-10-24 21:12:19 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.memusage.MemoryUsage',
 'scrapy.extensions.logstats.LogStats']
 2018-10-24 21:12:19 [scrapy.middleware] INFO: Enabled downloader 
 middlewares:
 ['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',

 'scrapy.dowladermidlewares.downloatimeout.DownloadTi\meoutMidleware'
 'scrapy.downloadermiddlewares.defaltheaders.DefaultHedersMidleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',



 'scrapy.dowloadermiddlewares.httpcompression.HtpCompressionMddleware    
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats',
 'scrapy.downloadermiddlewares.httpcache.HttpCacheMiddleware']
 2018-10-24 21:12:19 [scrapy.middleware] INFO: Enabled spider 
 middlewares:
 ['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
 2018-10-24 21:12:19 [scrapy.middleware] INFO: Enabled item 
 pipelines:
 ['scrapy.pipelines.images.ImagesPipeline']
 2018-10-24 21:12:19 [scrapy.core.engine] INFO: Spider opened
 2018-10-24 21:12:19 [scrapy.extensions.logstats] INFO: Crawled 0 
 pages 
 (at 0 pages/min), scraped 0 items (at 0 items/min)
 2018-10-24 21:12:19 [scrapy.extensions.httpcache] DEBUG: Using 
 filesystem 
 cache storage in 

 /Users/app_10/serial_kil 
lers/.scrapy/httpcache
2018-10-24 21:12:19 [scrapy.extensions.telnet] DEBUG: Telnet console 
listening on 127.0.0.1:6023
2018-10-24 21:12:19 [scrapy.core.engine] DEBUG: Crawled (200) <GET 
http://murderpedia.org/robots.txt> (referer: None) ['cached']
2018-10-24 21:12:19 [scrapy.core.engine] DEBUG: Crawled (200) <GET 
http://murderpedia.org/male.A/index.A.htm> (referer: None) ['cached']
2018-10-24 21:12:19 [scrapy.core.engine] INFO: Closing spider 
(finished)
2018-10-24 21:12:19 [scrapy.statscollectors] INFO: Dumping Scrapy 
stats:
{'downloader/request_bytes': 456,
 'downloader/request_count': 2,
 'downloader/request_method_count/GET': 2,
 'downloader/response_bytes': 29306,
 'downloader/response_count': 2,
 'downloader/response_status_count/200': 2,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2018, 10, 25, 1, 12, 19, 569830),
 'httpcache/hit': 2,
 'log_count/DEBUG': 4,
 'log_count/INFO': 7,
 'memusage/max': 47525888,
 'memusage/startup': 47525888,
 'response_received_count': 2,
 'scheduler/dequeued': 1,
 'scheduler/dequeued/memory': 1,
 'scheduler/enqueued': 1,
 'scheduler/enqueued/memory': 1,
 'start_time': datetime.datetime(2018, 10, 25, 1, 12, 19, 415905)}
2018-10-24 21:12:19 [scrapy.core.engine] INFO: Spider closed 
(finished)

Answer 1

似乎您的搜寻器未正确链接。

您想要的搜寻逻辑是：

1. Go to A listing page
2. Go to every listed person
3. Parse html of every person

现在您的代码缺少步骤2

让我们尝试一下：

class MySpider(Spider):
    name = 'corn-flake-killers'
    start_urls = ['http://murderpedia.org/male.A/index.A.htm']

    def parse(self, response):
        # find table
        # we can find table by looking for text and then going up the xml tree
        table= response.xpath('//td[contains(font//font/text(),"Victims")]/../..')
        # find every url in the table
        urls = table.xpath('//a/@href').extract()
        for url in urls:
            # for every url download person's page to parse_person callback
            yield Request(response.urljoin(url), self.parse_person)

    def parse_person(self, response):
        item = {}
        # parse person html here
        yield item

Scrapy Spider不归还任何东西

1 个答案: