Question

使用scrapy，我只想获取onclink函数的参数，我正在使用response.css（）提取链接。

如果我使用正则表达式仅获取参数，则会收到错误消息（AttributeError：“ list”对象没有属性“ re”）

    <table class="table table-striped table-bordered table-hover Tax" >
               <thead>
                  <tr>
                    <th>Sr No.</th>
                    <th>Name</th>
                    <th>Registration No</th>
                    <th>Address</th>
                    <th>Sectors</th>
                  </tr>
               </thead>
               <tbody>
<tr>

    <td>1</td><td> <a href="javascript:void(0)" onclick='show_info("173543");'> ABCD</a></td>
                    <td>Address</td>
                    <td>12345</td>
                    <td>Data Not Found</td>
                  </tr></tbody></table>

我正在使用Scrapy for Scrap onclick参数

link_first = response.css(".table.table-striped.table-bordered.table-hover.Tax>tbody>tr>td>a").xpath("./@onclick").extract().re("show_info\((.+?)\)", text)

所需的O / P：173543

Answer 1

extract()提取文本数据作为字符串列表。要将选择器与正则表达式匹配，您需要在选择器本身上使用re()。

html = """<table class="table table-striped table-bordered table-hover Tax" >
            <thead>
                <tr>
                    <th>Sr No.</th>
                    <th>Name</th>
                    <th>Registration No</th>
                    <th>Address</th>
                    <th>Sectors</th>
                </tr>
            </thead>
            <tbody>
<tr>

    <td>1</td><td> <a href="javascript:void(0)" onclick='show_info("173543");'> ABCD</a></td>
                    <td>Address</td>
                    <td>12345</td>
                    <td>Data Not Found</td>
                </tr></tbody></table>"""

from scrapy.selector import Selector 
response= Selector(text=html)
links = response.css(".table.table-striped.table-bordered.table-hover.Tax>tbody>tr>td>a").xpath("./@onclick").re("show_info\((.+?)\)")

print links

返回：

[u'"173543"']

希望这会有所帮助：）

Answer 2

我使用XPath contains来获取正确的onclick内容并使用re_first()进行解析

link_id = response.xpath('//td/a[contains(@onclick, "show_info")]/@onclick').re_first( r'"([^"]+)"')

Scrapy中的正则表达式以获取onclick函数的参数

2 个答案: