我正在测试scrapy,当它嵌套在标签中时,无法弄清楚如何在没有标签的情况下检索纯文本。这是我测试它的URL: http://www.tripadvisor.com/ShowTopic-g293915-i3686-k8824646-What_s_the_coolest_thing_you_saw_or_did_in_Thailand-Thailand.html
期望的输出: content of the posts as separate elements in the item[body] object
我的代码:
import scrapy
from tripadvisor.items import TripadvisorItem
class TripadvisorSpider(scrapy.Spider):
[...]
def parse_thread_contents(self, response):
url = response.url
item = TripadvisorItem()
for sel in response.xpath('//div[@class="balance"]'):
item['body'] = sel.xpath('//div[@class="postBody"]//p').extract()
yield item
答案 0 :(得分:1)
您需要获取text()
元素的p
。循环中还存在一个问题 - 您需要逐个遍历帖子并获取帖子主体并将其收集到列表中:
item['body'] = ["".join(post.xpath('.//div[@class="postBody"]/p/text()').extract())
for post in response.xpath('//div[@class="postcontent"]')]
另请注意,表达式开头的点也很重要 - 它会使搜索特定于上下文。
演示:
In [1]: for post in response.xpath('//div[@class="postcontent"]'):
...: print("".join(post.xpath('.//div[@class="postBody"]/p/text()').extract()))
...:
What's that memory you'll carry forever with you? Maybe you stayed on a floating hut in Khao Sok Lake, or you washed elephants in a sanctuary, or....I have no idea. Please share if you like, I'd love to hear!
The heat when you you go to for the first time, my blessing ceremony with my husband on Bottle Beach is up there, as is the first time I met him in Samui. Phang Nga Bay on the west coast is stunning and took my breath away, I overnighted on a friend's boat and watched the stars come out. Hong Island was amazing and arriving at Koh Racha before it had hotels on it. Early morning mist on the river at Amphawa whilst looking across to a beautiful temple, the Chao Praya River in Bangkok, the Reclining Buddha at Wat Pho - I could go on and on. : )
First trip to few years back. Not very informed, no smart phone, no google earth....rent a bike, with my wife and we just ride the bike "till the road ends"...ended up at their local uni, watch student going in and out of the uni gate, sat on the road side having a coke. No worries...just me and my wife.Cassnu, pls...go on and on...we dont mind.
...