Python - 为什么在Scrapy中破坏XPath text()?

时间:2018-03-31 13:10:38

标签: python python-3.x web-scraping scrapy scrapy-spider

当我尝试从下面显示的论坛中删除帖子时,使用Scrapy和Xpath:

item['post'] = response.xpath('.//div[@class="post-content"]//p/text()').extract_first().encode('utf-8')

源代码:

<div class="post-content" data-post-id="1466409">
                    <p>Hello,<br />
I would like to create an application</p>

但是我只得到"Hello,"

关于如何解决问题的任何想法:

Hello,\nI would like to create an application?`

3 个答案:

答案 0 :(得分:1)

您可以使用:/ p [descendant-or-self :: text()]

答案 1 :(得分:0)

您可以使用scrapy shell来测试html的一小部分:

创建test.html

<div class="post-content" data-post-id="1466409">
                    <p>Hello,<br />
I would like to create an application</p></div>

然后运行scrapy shell ./test.html

>> ' '.join(response.xpath('//div[@class="post-content"]//p/text()').extract())
'Hello, \nI would like to create an application'

或者,如果您只想发帖,请更新test.html

<div class="post-content" data-post-id="1466409">
                    <p>Hello,<br />
I would like to create an application</p></div>

<div class="post-content" data-post-id="1466410">
                    <p>Hello,<br />
I would like to create an application1</p></div>  

再次运行scrapy shell scrapy shell ./test.html

>>> ' '.join(response.xpath('//div[@data-post-id="1466409"]//p/text()').extract())

但是,我想,你不知道每个帖子的data-post-id,所以,在这种情况下你可以做这样的事情来获得第一篇文章:

>>> from bs4 import BeautifulSoup
>>> first_post=response.xpath('//div[@class="post-content"]').extract_first()
>>> alist=BeautifulSoup(first_post).findAll('p')
>>> ''.join([p.get_text() for p in alist])
'Hello,\nI would like to create an application'

或者,迭代所有帖子:

>>> all_posts=response.xpath('//div[@class="post-content"]').extract()
>>> for post in all_posts:
...     alist=BeautifulSoup(post).findAll('p')
...     ''.join([p.get_text() for p in alist])
... 
'Hello,\nI would like to create an application'
'Hello,\nI would like to create an application1'

答案 2 :(得分:0)

您的p有三个要素:

  • 包含&#34; Hello,&#34;
  • 的文本元素
  • br元素
  • 包含&#34的文本元素;我想创建一个应用程序&#34;

您的选择器获取p中的所有(两个)文本元素。然后,使用extract_first(),您要求第一个。因此,结果包含&#34; Hello,&#34;。

应该不足为奇

如果您想获取所有 p的内容,并用换行符替换br元素,则必须自行完成。