Question

我正在尝试使用Scrapy从此页面提取数据：https://www.interpol.int/notice/search/woa/1192802

蜘蛛会抓取多个页面但我在这里排除了分页代码以保持简单。问题是我想在每个页面上刮取的表行数每次都会改变。

所以我需要一种从页面中抓取所有表格数据的方法，无论它有多少表格行。

首先，我提取了页面上的所有表格行。然后，我创建了一个空白字典。接下来，我尝试遍历每一行并将其单元格数据放入字典中。

但它不起作用，它返回一个空白文件。

知道什么是错的吗？

# -*- coding: utf-8 -*-
import scrapy


class Test1Spider(scrapy.Spider):
    name = 'test1'
    allowed_domains = ['interpol.int']
    start_urls = ['https://www.interpol.int/notice/search/woa/1192802']

    def parse(self, response):
        table_rows = response.xpath('//*[contains(@class,"col_gauche2_result_datasheet")]//tr').extract()
        data = {}
        for table_row in table_rows:
            data.update({response.xpath('//td[contains(@class, "col1")]/text()').extract(): response.css('//td[contains(@class, "col2")]/text()').extract()})
        yield data

Answer 1

这是什么？

response.css('//td[contains(@class, "col2")]/text()').extract()

您正在调用css()方法，但是您正在给它xpath

无论如何，这是100％正常工作的代码，我已经测试过了。

table_rows = response.xpath('//*[contains(@class,"col_gauche2_result_datasheet")]//tr')
data = {}
for table_row in table_rows:
    data[table_row.xpath('td[@class="col1"]/text()').extract_first().strip()] = table_row.xpath('td[@class="col2 strong"]/text()').extract_first().strip()
yield data

修改

要删除\t\n\r等字符，请使用正则表达式。

import re your_string = re.sub('\\t|\\n|\\r', '', your_string)

Answer 2

试试这个。

我希望它会对你有所帮助。

Please try this one

public partial class iseng
{
    public int Id { get; set; }
    public bool hobi1 { get; set; }
    public bool hobi2 { get; set; }
    public bool hobi3 { get; set; }
}

    @Html.CheckBoxFor(x => x.hobi1, "Makan")
        @Html.CheckBoxFor(x => x.hobi2, "Minum")
        @Html.CheckBoxFor(x => x.hobi3, "Tidur")

or use like this

@Html.CheckBox("id", true)

Scrapy从表行中提取数据

2 个答案: