我正在使用scrapy crawl spider并尝试解析输出页面以选择一些输入标记参数(类型,id,名称),每个数据类型都被选择到一个项目中,以便它稍后将存储在数据库中:
Database Table_1
╔════════════════╗
║ text ║
╠════════════════╣
║ id │ name ║
╟──────┼─────────╢
║ │ ║
╟──────┼─────────╢
║ │ ║
╚══════╧═════════╝
密码和文件也一样,但
我面临的问题是xpath提取整个标签!!
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from scrapy.item import Item, Field
from isa.items import IsaItem
class MySpider(CrawlSpider):
name = 'example.com'
allowed_domains = ['testaspnet.vulnweb.com']
start_urls = ['http://testaspnet.vulnweb.com']
rules = (
Rule(SgmlLinkExtractor(allow=('/*' ) ),callback='parse_item'),)
def parse_item(self, response):
self.log('%s' % response.url)
hxs = HtmlXPathSelector(response)
item=IsaItem()
text_input=hxs.select("//input[(@id or @name) and (@type = 'text' )]").extract()
pass_input=hxs.select("//input[(@id or @name) and (@type = 'password')]").extract()
file_input=hxs.select("//input[(@id or @name) and (@type = 'file')]").extract()
print text_input , pass_input ,file_input
return item
输出
me@me-pc:~/isa/isa$ scrapy crawl example.com -L INFO -o file_nfffame.csv -t csv
2012-07-02 12:42:02+0200 [scrapy] INFO: Scrapy 0.14.4 started (bot: isa)
2012-07-02 12:42:02+0200 [example.com] INFO: Spider opened
2012-07-02 12:42:02+0200 [example.com] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
[] [] []
[] [] []
[] [] []
[u'<input name="tbUsername" type="text" id="tbUsername" class="Login">'] [u'<input name="tbPassword" type="password" id="tbPassword" class="Login">'] []
[] [] []
[u'<input name="tbUsername" type="text" id="tbUsername" class="Login">'] [u'<input name="tbPassword" type="password" id="tbPassword" class="Login">'] []
[] [] []
2012-07-02 12:42:08+0200 [example.com] INFO: Closing spider (finished)
答案 0 :(得分:0)
如果我理解正确,你想从输入中提取属性值。
您当前的XPath为您提供了整个节点,因为这就是您所要求的。 XPath选择器上升到某个节点,但不超出该节点的特定属性。
获取节点的id
属性而不是节点本身:
some/xpath/query/@id
答案 1 :(得分:0)
使用强>:
//yourCurrentExpression/@id
获取id
属性。
使用强>:
//yourCurrentExpression/text()
获取由yourCurrentExpression
元素选择的任何文本节点子节点。
最后,您可以将两个表达式合并为一个:
//yourCurrentExpression/@id | //yourCurrentExpression/text()
这将生成一个节点列表,其中的项目按如下顺序排序:(id-attribute, text-node)*
,换句话说,所选节点按文档顺序表示。