Question

我真的需要帮助来完成这项任务，因为它与我的研究有关，而且我是蟒蛇和scrapy的新手。

* 任务是选择所有输入字段（type = text或密码或文件）并将其（id）存储在后端数据库中，除了此输入所属的页面链接* < / p>

我选择输入字段的代码

def parse_item(self, response):
    self.log('%s' % response.url)

    hxs = HtmlXPathSelector(response)
    item=IsaItem()
    item['response_fld']=response.url

    item['text_input']=hxs.select("//input[(@id or @name) and (@type = 'text' )]/@id ").extract()
    item['pass_input']=hxs.select("//input[(@id or @name) and (@type = 'password')]/@id").extract()
    item['file_input']=hxs.select("//input[(@id or @name) and (@type = 'file')]/@id").extract()

    return item

数据库管道代码：

class SQLiteStorePipeline(object):


def __init__(self):
    self.conn = sqlite3.connect('./project.db')
    self.cur = self.conn.cursor()


def process_item(self, item, spider):
    self.cur.execute("insert into inputs ( input_name) values(?)" , (item['text_input'][0] ), )
    self.cur.execute("insert into inputs ( input_name) values(?)" , (item['pass_input'][0]  ,))
    self.cur.execute("insert into inputs ( input_name) values(?)" ,(item['file_input'][0] ,  ))

    self.cur.execute("insert into links (link) values(?)", (item['response_fld'][0], ))

    self.conn.commit()
    return item

但我仍然会收到这样的错误

self.cur.execute("insert into inputs ( input_name) values(?)" , (item['text_input'][0] ), )
exceptions.IndexError: list index out of range

或数据库商店只有第一个字母!!

Database links table 
 ╔════════════════╗
 ║      links     ║ 
 ╠════════════════╣
 ║  id  │input    ║ 
 ╟──────┼─────────╢
 ║    1 │     t   ║ 
 ╟──────┼─────────╢
 ║    2 │     t   ║ 
 ╚══════╧═════════╝
Note it should "tbPassword" or "tbUsername"

输出fron json文件

{"pass_input": ["tbPassword"], "file_input": [], "response_fld":     "http://testaspnet.vulnweb.com/Signup.aspx", "text_input": ["tbUsername"]}
{"pass_input": [], "file_input": [], "response_fld": "http://testaspnet.vulnweb.com/default.aspx", "text_input": []}
{"pass_input": ["tbPassword"], "file_input": [], "response_fld": "http://testaspnet.vulnweb.com/login.aspx", "text_input": ["tbUsername"]}
{"pass_input": [], "file_input": [], "response_fld": "http://testaspnet.vulnweb.com/Comments.aspx?id=0", "text_input": []}

Answer 1

我对这项技术一无所知，但这是我的猜测：

尝试 insert into inputs ( input_name) values(?)" , (item['text_input'] ) 代替 insert into inputs ( input_name) values(?)" , (item['text_input'][0] )。

至于'列表索引超出范围'错误，您的项目似乎是空的，您应该检查。

Answer 2

您收到IndexError因为您尝试获取列表中的第一项，有时这是空的。

我会这样做。

蜘蛛：

def parse_item(self, response):
    self.log('%s' % response.url)

    hxs = HtmlXPathSelector(response)
    item = IsaItem()
    item['response_fld'] = response.url

    res = hxs.select("//input[(@id or @name) and (@type = 'text' )]/@id ").extract()
    item['text_input'] = res[0] if res else None # None is default value in case no field found

    res = hxs.select("//input[(@id or @name) and (@type = 'password')]/@id").extract()
    item['pass_input'] = res[0] if res else None # None is default value in case no field found

    res = hxs.select("//input[(@id or @name) and (@type = 'file')]/@id").extract()
    item['file_input'] = res[0] if res else None # None is default value in case no field found

    return item

管道：

class SQLiteStorePipeline(object):

    def __init__(self):
        self.conn = sqlite3.connect('./project.db')
        self.cur = self.conn.cursor()


    def process_item(self, item, spider):
        self.cur.execute("insert into inputs ( input_name) values(?)", (item['text_input'],))
        self.cur.execute("insert into inputs ( input_name) values(?)", (item['pass_input'],))
        self.cur.execute("insert into inputs ( input_name) values(?)", (item['file_input'],))

        self.cur.execute("insert into links (link) values(?)", (item['response_fld'],))

        self.conn.commit()
        return item

sqlite3数据库错误..我真累了:(

2 个答案: