当我尝试从下面显示的论坛中删除帖子时,使用Scrapy和Xpath:
item['post'] = response.xpath('.//div[@class="post-content"]//p/text()').extract_first().encode('utf-8')
源代码:
<div class="post-content" data-post-id="1466409">
<p>Hello,<br />
I would like to create an application</p>
但是我只得到"Hello,"
。
关于如何解决问题的任何想法:
Hello,\nI would like to create an application
?`
答案 0 :(得分:1)
您可以使用:/ p [descendant-or-self :: text()]
答案 1 :(得分:0)
您可以使用scrapy shell来测试html的一小部分:
创建test.html
:
<div class="post-content" data-post-id="1466409">
<p>Hello,<br />
I would like to create an application</p></div>
然后运行scrapy shell ./test.html
>> ' '.join(response.xpath('//div[@class="post-content"]//p/text()').extract())
'Hello, \nI would like to create an application'
或者,如果您只想发帖,请更新test.html
:
<div class="post-content" data-post-id="1466409">
<p>Hello,<br />
I would like to create an application</p></div>
<div class="post-content" data-post-id="1466410">
<p>Hello,<br />
I would like to create an application1</p></div>
再次运行scrapy shell scrapy shell ./test.html
:
>>> ' '.join(response.xpath('//div[@data-post-id="1466409"]//p/text()').extract())
但是,我想,你不知道每个帖子的data-post-id
,所以,在这种情况下你可以做这样的事情来获得第一篇文章:
>>> from bs4 import BeautifulSoup
>>> first_post=response.xpath('//div[@class="post-content"]').extract_first()
>>> alist=BeautifulSoup(first_post).findAll('p')
>>> ''.join([p.get_text() for p in alist])
'Hello,\nI would like to create an application'
或者,迭代所有帖子:
>>> all_posts=response.xpath('//div[@class="post-content"]').extract()
>>> for post in all_posts:
... alist=BeautifulSoup(post).findAll('p')
... ''.join([p.get_text() for p in alist])
...
'Hello,\nI would like to create an application'
'Hello,\nI would like to create an application1'
答案 2 :(得分:0)
您的p
有三个要素:
br
元素您的选择器获取p
中的所有(两个)文本元素。然后,使用extract_first()
,您要求第一个。因此,结果包含&#34; Hello,&#34;。
如果您想获取所有 p
的内容,并用换行符替换br
元素,则必须自行完成。