Question

我正试图用scrapy刮一张桌子该表有tr> th> td元素这是表格的结构

<table class="project-table">
<tr>
    <th>Price Per Sqft from</th>
    <td>AED 880</td>
</tr>
<tr>
    <th>Type</th>
    <td class="project-typess">
    <a href="https://dxboffplan.com/new/apartments-for-sale-dubai/">Apartments</a>
    </td>
</tr>

你可以注意到一些td元素只有文本而其他元素有 a 元素这是我到目前为止的代码

def parse(self, response):
    # get the urls of each property
    urls = response.css('div.property-listing > a::attr(href)').extract()
    # for each property make a request to get the details of each property
    for url in urls:
        yield scrapy.Request(url = url , callback = self.parse_details )
    # go and get the next link for the next property
    next_page = response.css('div.property-listing > a::attr(href)').extract_first()
    # to get the details of the property we go throught a life cycle
    yield scrapy.Request(url = next_page , callback = self.parse )

def parse_details(self , response):
    for item in response.css('table.project-table> tr '):
        var = DxbItem()
        var['item']             = item.css('th::text').extract()[0]
        var['value']            = item.css('td::text').extract()[0]
        # i've tried everything i know but nothing works 
        if not var['value']:
            var['value']        = item.css('td>a::text').extract()[0]
        yield var

我需要获得v [＆＃39; value＆＃39;] IndexError：列表索引超出范围我也试过这个

    for item in response.css('table.project-table> tr '):
        var = DxbItem()
        var['item']             = item.css('th::text').extract()[0]
        a                       = item.css('td::text').extract()[0] # 
        b                       = item.css('td>a::text').extract()[0]
        var['value']            = a + b # to concatenate 2 lists 
        yield var

Answer 1

在文档中，您可以按extract_first()方法进行搜索。

但是，使用.extract_first（）可以避免IndexError，并在找不到与选择匹配的任何元素时返回None。

用scrapy

1 个答案: