Question

我正在尝试让Scrapy从论坛https://bitcointalk.org/index.php?topic=1209137.0中提取作者，日期和帖子，并将其导入我的项目。

我想要的结果是:(带有无关的html，我稍后会清理）

作者1，日期1，帖子1

作者2，日期2，帖子2，

但相反，我得到：作者1,2,3,4日期1,2,3,4，发布1,2,3,4

我已经四处搜索并阅读了有关将xPaths从绝对更改为相对的一些内容，但我似乎无法使其正常工作。我不确定这是否是根本原因，或者我是否需要创建管道来转换数据？

*************更新**********代码附件*********************

class Bitorg(scrapy.Spider):
name = "bitorg"
allowed_domains = ["bitcointalk.org"]
start_urls = [
    "https://bitcointalk.org/index.php?topic=1209137.0"
]

def parse(self, response):
    for sel in response.xpath('..//html/body'):
        item = BitorgItem()
        item['author'] = sel.xpath('.//b/a[@title]').extract()
        item['date'] = sel.xpath('.//td[@valign="middle"]/div[@class="smalltext"]').extract()
        item['post'] = sel.xpath('.//div[@class="post"]').extract()
        yield item

Answer 1

虽然<table>，<tbody>和<tr>元素没有可以轻松选择的属性，但每个帖子的<td>类{ {1}}。

要获取所有帖子的列表，请在poster_info上选择，然后使用xpath <td>表示法向上移动树。

..

在每篇文章中，选择感兴趣的子元素。

posts = response.xpath('//*[@class="poster_info"]/..')

Answer 2

你知道所有代码只是一个带有小桌子的大div 和作者的xpath

/html/body/div[2]/form/table[1]/tbody/tr[1]/td/table/tbody/tr/td/table/tbody/tr[1]/td[1]/b/a
/html/body/div[2]/form/table[1]/tbody/tr[5]/td/table/tbody/tr/td/table/tbody/tr[1]/td[1]/b/a
/html/body/div[2]/form/table[1]/tbody/tr[6]/td/table/tbody/tr/td/table/tbody/tr[1]/td[1]/b/a

你可以使用这个，以便你可以刮掉任何东西

l = XPathItemLoader(item = JustDialItem(),response = response)
for i in range(1,10):
        l.add_xpath('content1','//*[@id="bodyarea"]/form/table[1]/tbody/tr['+str(i)+']/td/table/tbody/tr/td/table/tbody/tr[1]/td[1]/b/a/text()')
        l.add_xpath('content2','//*[@id="bodyarea"]/form/table[1]/tbody/tr['+str(i)+']/td/table/tbody/tr/td/table/tbody/tr[1]/td[1]/b/a/text()')
        l.add_xpath('content3','//*[@id="bodyarea"]/form/table[1]/tbody/tr['+str(i)+']/td/table/tbody/tr/td/table/tbody/tr[1]/td[1]/b/a/text()')

同样的方式，您也可以做日期和发布

Scrapy：难以输出多行，xPath问题？

2 个答案: