如何使用scrapy从具有数据库的网页中提取数据

时间:2013-06-28 19:24:48

标签: python database web-scraping screen-scraping scrapy

您好我正在使用scrapy访问我们的Intranet网站并进行一些抓取,一切似乎都在工作我能够访问它,但是当我将数据提取到csv文件时,csv文件为空我没有得到任何错误我跑这个。 HTML中的每个列(目录号,设备ID,订户ID等)都有数据,因此如何使用scrapy获取该数据。

from scrapy.contrib.spiders.init import InitSpider
from scrapy.http import Request, FormRequest
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.contrib.spiders import Rule

from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector

from carrier.items import CarrierItem

class CarrierSpider(InitSpider):
    name = 'dis'
    allowed_domains = ['qvpweb01.ciq.labs.att.com']
    login_page = 'https://qvpweb01.ciq.labs.att.com:8080/dis/login.jsp'
    start_urls = ["https://qvpweb01.ciq.labs.att.com:8080/dis/"]

    def init_request(self):
    #"""This function is called before crawling starts."""
    return Request(url=self.login_page, callback=self.login)

    def login(self, response):
    #"""Generate a login request."""
    return FormRequest.from_response(response,
            formdata={'txtUserName': 'xxxx', 'txtPassword': 'secret'},
            callback=self.check_login_response)

    def check_login_response(self, response):
    #"""Check the response returned by a login request to see if we aresuccessfully logged in."""
    if "logout" in response.body:
        self.log("\n\n\nSuccessfully logged in. Let's start crawling!\n\n\n")
        # Now the crawling can begin..

        return self.initialized() 

    else:
        self.log("\n\n\nFailed, Bad password :(\n\n\n")
        # Something went wrong, we couldn't log in, so nothing happens.

    def parse(self, response):
    self.log("\n\n\n We got data! \n\n\n")
    hxs = HtmlXPathSelector(response)
    sites = hxs.select('//form[@id=\'listForm\']/table/')
    items = []
    for site in sites:
        item = CarrierItem()
        item['title'] = site.select('thead/th/a/text()').extract()
        item['link'] = site.select('thead/th/a/@href').extract()
        items.append(item)
    return items
来自页面的

html

<table width="100%" cellspacing="0" cellpadding="2" border="0">
<table width="100%" cellspacing="0" cellpadding="0" border="0">
<tbody>
<tr>
<td valign="top" align="left">
<html>
<html>
<div class="clr"></div>
<table cellspacing="1" cellpadding="0" border="0">
<tbody>
<tr>
<tr>
<td align="left" colspan="2">
<div class="list">
<form id="listForm" name="listForm" method="POST" action="">
<table>
<thead>
<th class="first"></th>
<th>
<a href="/dis/?&orderby=mdn&order=asc">Directory Number</a>
</th>
<th>
<a href="/dis/?&orderby=hardwareId&order=asc">Equipment ID</a>
</th>
<th>
<a href="/dis/?&orderby=subscriberId&order=asc">Subscriber ID</a>
</th>
<th class="arrow_up">
<th>
<a href="/dis/?&orderby=sessUpldTime&order=asc">Session Upload Time</a>
</th>
<th>
<a href="/dis/?&orderby=upldRsn&order=asc">Upload Reason</a>
</th>
<th>
<a href="/dis/?&orderby=prof&order=asc">Profile ID</a>
</th>
<th>
<img width="1" height="1" src="/dis/img/spacer.gif" alt="">
</th>
<th class="last">
<img width="1" height="1" src="/dis/img/spacer.gif" alt="">
</th>
</thead>
<tbody>
</table>
</form>
</div>
</td>
</tr>
</tbody>
</table>
</td>
</tr>
</tbody>
</table>
运行抓取工具时输出

C:\Users\ye831c\Documents\Big Data\Scrapy\carrier>scrapy crawl dis -o iqDis.csv
-t csv
2013-06-28 14:10:41-0500 [scrapy] INFO: Scrapy 0.16.5 started (bot: carrier)
2013-06-28 14:10:41-0500 [scrapy] DEBUG: Enabled extensions: FeedExporter, LogSt
ats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2013-06-28 14:10:42-0500 [scrapy] DEBUG: Enabled downloader middlewares: HttpAut
hMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, De
faultHeadersMiddleware, RedirectMiddleware, CookiesMiddleware, HttpCompressionMi
ddleware, ChunkedTransferMiddleware, DownloaderStats
2013-06-28 14:10:42-0500 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMi
ddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddle
ware
2013-06-28 14:10:42-0500 [scrapy] DEBUG: Enabled item pipelines:
2013-06-28 14:10:42-0500 [dis] INFO: Spider opened
2013-06-28 14:10:42-0500 [dis] INFO: Crawled 0 pages (at 0 pages/min), scraped 0
 items (at 0 items/min)
2013-06-28 14:10:42-0500 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:602
3
2013-06-28 14:10:42-0500 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080
2013-06-28 14:10:42-0500 [dis] DEBUG: Crawled (200) <GET https://qvpweb01.ciq.la
bs.att.com:8080/dis/login.jsp> (referer: None)
2013-06-28 14:10:42-0500 [dis] DEBUG: Redirecting (302) to <GET https://qvpweb01
.ciq.labs.att.com:8080/dis/> from <POST https://qvpweb01.ciq.labs.att.com:8080/d
is/login>
2013-06-28 14:10:43-0500 [dis] DEBUG: Crawled (200) <GET https://qvpweb01.ciq.la
bs.att.com:8080/dis/> (referer: https://qvpweb01.ciq.labs.att.com:8080/dis/login
.jsp)
2013-06-28 14:10:43-0500 [dis] DEBUG:


    Successfully logged in. Let's start crawling!



2013-06-28 14:10:44-0500 [dis] DEBUG: Crawled (200) <GET https://qvpweb01.ciq.la
bs.att.com:8080/dis/> (referer: https://qvpweb01.ciq.labs.att.com:8080/dis/)
2013-06-28 14:10:44-0500 [dis] DEBUG:


     We got data!



2013-06-28 14:10:44-0500 [dis] INFO: Closing spider (finished)
2013-06-28 14:10:44-0500 [dis] INFO: Dumping Scrapy stats:
    {'downloader/request_bytes': 1382,
     'downloader/request_count': 4,
     'downloader/request_method_count/GET': 3,
     'downloader/request_method_count/POST': 1,
     'downloader/response_bytes': 146604,
     'downloader/response_count': 4,
     'downloader/response_status_count/200': 3,
     'downloader/response_status_count/302': 1,
     'finish_reason': 'finished',
     'finish_time': datetime.datetime(2013, 6, 28, 19, 10, 44, 469000),
     'log_count/DEBUG': 12,
     'log_count/INFO': 4,
     'request_depth_max': 2,
     'response_received_count': 3,
     'scheduler/dequeued': 4,
     'scheduler/dequeued/memory': 4,
     'scheduler/enqueued': 4,
     'scheduler/enqueued/memory': 4,
     'start_time': datetime.datetime(2013, 6, 28, 19, 10, 42, 15000)}
2013-06-28 14:10:44-0500 [dis] INFO: Spider closed (finished)

C:\Users\ye831c\Documents\Big Data\Scrapy\carrier>

1 个答案:

答案 0 :(得分:1)

问题在于你的xpath表达式,它们应该是相对的(.//):

item['title'] = site.select('.//thead/th/a/text()').extract()
item['link'] = site.select('.//thead/th/a/@href').extract()

更新:

在讨论聊天中的问题后,发现scrapy收到的页面是在页面加载后使用js转换的xml。

这是帮助解析xml以获取必要数据的原因:

def parse(self, response):
    xhs = XmlXPathSelector(response)

    columns = hxs.select('//table[3]/header/column'')
    for column in columns:
        item = CarrierItem()
        item['title'] = column.select('.//text()').extract()
        item['link'] = column.select('.//@uri').extract()
        yield item