我真的需要帮助来完成这项任务,因为它与我的研究有关,而且我是蟒蛇和scrapy的新手。
* 任务是选择所有输入字段(type = text或密码或文件)并将其(id)存储在后端数据库中,除了此输入所属的页面链接* < / p>
我选择输入字段的代码
def parse_item(self, response):
self.log('%s' % response.url)
hxs = HtmlXPathSelector(response)
item=IsaItem()
item['response_fld']=response.url
item['text_input']=hxs.select("//input[(@id or @name) and (@type = 'text' )]/@id ").extract()
item['pass_input']=hxs.select("//input[(@id or @name) and (@type = 'password')]/@id").extract()
item['file_input']=hxs.select("//input[(@id or @name) and (@type = 'file')]/@id").extract()
return item
数据库管道代码:
class SQLiteStorePipeline(object):
def __init__(self):
self.conn = sqlite3.connect('./project.db')
self.cur = self.conn.cursor()
def process_item(self, item, spider):
self.cur.execute("insert into inputs ( input_name) values(?)" , (item['text_input'][0] ), )
self.cur.execute("insert into inputs ( input_name) values(?)" , (item['pass_input'][0] ,))
self.cur.execute("insert into inputs ( input_name) values(?)" ,(item['file_input'][0] , ))
self.cur.execute("insert into links (link) values(?)", (item['response_fld'][0], ))
self.conn.commit()
return item
但我仍然会收到这样的错误
self.cur.execute("insert into inputs ( input_name) values(?)" , (item['text_input'][0] ), )
exceptions.IndexError: list index out of range
或数据库商店只有第一个字母!!
Database links table
╔════════════════╗
║ links ║
╠════════════════╣
║ id │input ║
╟──────┼─────────╢
║ 1 │ t ║
╟──────┼─────────╢
║ 2 │ t ║
╚══════╧═════════╝
Note it should "tbPassword" or "tbUsername"
输出fron json文件
{"pass_input": ["tbPassword"], "file_input": [], "response_fld": "http://testaspnet.vulnweb.com/Signup.aspx", "text_input": ["tbUsername"]}
{"pass_input": [], "file_input": [], "response_fld": "http://testaspnet.vulnweb.com/default.aspx", "text_input": []}
{"pass_input": ["tbPassword"], "file_input": [], "response_fld": "http://testaspnet.vulnweb.com/login.aspx", "text_input": ["tbUsername"]}
{"pass_input": [], "file_input": [], "response_fld": "http://testaspnet.vulnweb.com/Comments.aspx?id=0", "text_input": []}
答案 0 :(得分:0)
我对这项技术一无所知,但这是我的猜测:
尝试
insert into inputs ( input_name) values(?)" , (item['text_input'] )
代替
insert into inputs ( input_name) values(?)" , (item['text_input'][0] )
。
至于'列表索引超出范围'错误,您的项目似乎是空的,您应该检查。
答案 1 :(得分:0)
您收到IndexError
因为您尝试获取列表中的第一项,有时这是空的。
我会这样做。
蜘蛛:
def parse_item(self, response):
self.log('%s' % response.url)
hxs = HtmlXPathSelector(response)
item = IsaItem()
item['response_fld'] = response.url
res = hxs.select("//input[(@id or @name) and (@type = 'text' )]/@id ").extract()
item['text_input'] = res[0] if res else None # None is default value in case no field found
res = hxs.select("//input[(@id or @name) and (@type = 'password')]/@id").extract()
item['pass_input'] = res[0] if res else None # None is default value in case no field found
res = hxs.select("//input[(@id or @name) and (@type = 'file')]/@id").extract()
item['file_input'] = res[0] if res else None # None is default value in case no field found
return item
管道:
class SQLiteStorePipeline(object):
def __init__(self):
self.conn = sqlite3.connect('./project.db')
self.cur = self.conn.cursor()
def process_item(self, item, spider):
self.cur.execute("insert into inputs ( input_name) values(?)", (item['text_input'],))
self.cur.execute("insert into inputs ( input_name) values(?)", (item['pass_input'],))
self.cur.execute("insert into inputs ( input_name) values(?)", (item['file_input'],))
self.cur.execute("insert into links (link) values(?)", (item['response_fld'],))
self.conn.commit()
return item