您好我正在尝试获取类listCell的标题和文本的xpath。我相信我做得对,因为我没有错误,但当我在csv文件中显示它时,我在输出文件中没有得到任何结果。我还在亚马逊等其他网站上测试了我的scrapy,它工作正常,但不适用于本网站。请帮忙!!
def parse(self, response):
self.log("\n\n\n We got data! \n\n\n")
hxs = HtmlXPathSelector(response)
sites = hxs.select('//form[@id=\'listForm\']/table/tbody/tr')
items = []
for site in sites:
item = CarrierItem()
item['title'] = site.select('.//td[@class\'listCell\']/a/text()').extract()
item['link'] = site.select('.//td[@class\'listCell\']/a/@href').extract()
items.append(item)
return items
这是我的HTML。难道它可能无法工作,因为它在html中有javascript吗?
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<title> Carrier IQ DIS 2.4 :: All Devices</title>
<script type="text/javascript" src="/dis/js/main.js">
<script type="text/javascript" src="/dis/js/validate.js">
<link rel="stylesheet" type="text/css" href="/dis/css/portal.css">
<link rel="stylesheet" type="text/css" href="/dis/css/style.css">
<script type="text/javascript">
....
<form id="listForm" name="listForm" method="POST" action="">
<table>
<thead>
<tbody>
<tr>
<td class="crt">1</td>
<td class="listCell" align="center">
<a href="/dis/packages.jsp?view=list&show=perdevice&device_gid=3651746C4173775343535452414567746D75643855673D3D53564A6151624D41716D534C68395A6337634E2F62413D3D&hwdid=probe0&mdn=6505550000&subscrbid=6505550000&maxlength=100">6505550000</a>
</td>
<td class="listCell" align="center">
<a href="/dis/packages.jsp?view=list&show=perdevice&device_gid=3651746C4173775343535452414567746D75643855673D3D53564A6151624D41716D534C68395A6337634E2F62413D3D&hwdid=probe0&subscrbid=6505550000&mdn=6505550000&maxlength=100">probe0</a>
</td>
<td class="listCell" align="center">
<td class="listCell" align="center">
<td class="cell" align="center">2013-07-01 13:39:38.820</td>
<td class="cell" align="left">1 - SMS_PullRequest_CS</td>
<td class="listCell" align="right">
<td class="listCell" align="center">
<td class="listCell" align="center">
</tr>
</tbody>
</table>
</form>
输出
C:\Users\ye831c\Documents\Big Data\Scrapy\carrier>scrapy crawl dis -o iqDis.csv
-t csv
2013-07-01 10:50:18-0500 [scrapy] INFO: Scrapy 0.16.5 started (bot: carrier)
2013-07-01 10:50:18-0500 [scrapy] DEBUG: Enabled extensions: FeedExporter, LogSt
ats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2013-07-01 10:50:19-0500 [scrapy] DEBUG: Enabled downloader middlewares: HttpAut
hMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, De
faultHeadersMiddleware, RedirectMiddleware, CookiesMiddleware, HttpCompressionMi
ddleware, ChunkedTransferMiddleware, DownloaderStats
2013-07-01 10:50:19-0500 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMi
ddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddle
ware
2013-07-01 10:50:19-0500 [scrapy] DEBUG: Enabled item pipelines:
2013-07-01 10:50:19-0500 [dis] INFO: Spider opened
2013-07-01 10:50:19-0500 [dis] INFO: Crawled 0 pages (at 0 pages/min), scraped 0
items (at 0 items/min)
2013-07-01 10:50:19-0500 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:602
3
2013-07-01 10:50:19-0500 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080
2013-07-01 10:50:19-0500 [dis] DEBUG: Crawled (200) <GET https://qvpweb01.ciq.la
bs.att.com:8080/dis/login.jsp> (referer: None)
2013-07-01 10:50:19-0500 [dis] DEBUG: Redirecting (302) to <GET https://qvpweb01
.ciq.labs.att.com:8080/dis/> from <POST https://qvpweb01.ciq.labs.att.com:8080/d
is/login>
2013-07-01 10:50:20-0500 [dis] DEBUG: Crawled (200) <GET https://qvpweb01.ciq.la
bs.att.com:8080/dis/> (referer: https://qvpweb01.ciq.labs.att.com:8080/dis/login
.jsp)
2013-07-01 10:50:20-0500 [dis] DEBUG:
Successfully logged in. Let's start crawling!
2013-07-01 10:50:21-0500 [dis] DEBUG: Crawled (200) <GET https://qvpweb01.ciq.la
bs.att.com:8080/dis/> (referer: https://qvpweb01.ciq.labs.att.com:8080/dis/)
2013-07-01 10:50:21-0500 [dis] DEBUG:
We got data!
2013-07-01 10:50:21-0500 [dis] INFO: Closing spider (finished)
2013-07-01 10:50:21-0500 [dis] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 1382,
'downloader/request_count': 4,
'downloader/request_method_count/GET': 3,
'downloader/request_method_count/POST': 1,
'downloader/response_bytes': 147888,
'downloader/response_count': 4,
'downloader/response_status_count/200': 3,
'downloader/response_status_count/302': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2013, 7, 1, 15, 50, 21, 221000),
'log_count/DEBUG': 12,
'log_count/INFO': 4,
'request_depth_max': 2,
'response_received_count': 3,
'scheduler/dequeued': 4,
'scheduler/dequeued/memory': 4,
'scheduler/enqueued': 4,
'scheduler/enqueued/memory': 4,
'start_time': datetime.datetime(2013, 7, 1, 15, 50, 19, 42000)}
2013-07-01 10:50:21-0500 [dis] INFO: Spider closed (finished)
答案 0 :(得分:0)
尝试简化您的XPath:
sites = hxs.select('//form[@id="listForm"]//tr')
由于tbody
元素(在某些情况下)不在HTML中,但是由您的浏览器生成。