Question

当我尝试从下面显示的论坛中删除帖子时，使用Scrapy和Xpath：

item['post'] = response.xpath('.//div[@class="post-content"]//p/text()').extract_first().encode('utf-8')

源代码：

<div class="post-content" data-post-id="1466409">
                    <p>Hello,<br />
I would like to create an application</p>

但是我只得到"Hello,"。

关于如何解决问题的任何想法：

Hello,\nI would like to create an application？`

Answer 1

您可以使用：/ p [descendant-or-self :: text（）]

Answer 2

您可以使用scrapy shell来测试html的一小部分：

创建test.html：

<div class="post-content" data-post-id="1466409">
                    <p>Hello,<br />
I would like to create an application</p></div>

然后运行scrapy shell ./test.html

>> ' '.join(response.xpath('//div[@class="post-content"]//p/text()').extract())
'Hello, \nI would like to create an application'

或者，如果您只想发帖，请更新test.html：

<div class="post-content" data-post-id="1466409">
                    <p>Hello,<br />
I would like to create an application</p></div>

<div class="post-content" data-post-id="1466410">
                    <p>Hello,<br />
I would like to create an application1</p></div>

再次运行scrapy shell scrapy shell ./test.html：

>>> ' '.join(response.xpath('//div[@data-post-id="1466409"]//p/text()').extract())

但是，我想，你不知道每个帖子的data-post-id，所以，在这种情况下你可以做这样的事情来获得第一篇文章：

>>> from bs4 import BeautifulSoup
>>> first_post=response.xpath('//div[@class="post-content"]').extract_first()
>>> alist=BeautifulSoup(first_post).findAll('p')
>>> ''.join([p.get_text() for p in alist])
'Hello,\nI would like to create an application'

或者，迭代所有帖子：

>>> all_posts=response.xpath('//div[@class="post-content"]').extract()
>>> for post in all_posts:
...     alist=BeautifulSoup(post).findAll('p')
...     ''.join([p.get_text() for p in alist])
... 
'Hello,\nI would like to create an application'
'Hello,\nI would like to create an application1'

Answer 3

您的p有三个要素：

包含＆＃34; Hello，＆＃34;
br元素
包含＆＃34的文本元素;我想创建一个应用程序＆＃34;

您的选择器获取p中的所有（两个）文本元素。然后，使用extract_first()，您要求第一个。因此，结果包含＆＃34; Hello，＆＃34;。

应该不足为奇

如果您想获取所有 p的内容，并用换行符替换br元素，则必须自行完成。

Python - 为什么在Scrapy中破坏XPath text（）？

3 个答案: