我试图从以下地方搜集旧金山所有酒店的列表: http://www.tripadvisor.com/Hotels-g60713-San_Francisco_California-Hotels.html
“下一家酒店”有独特的网址:
第2页是:/ Hotels-g60713-oa30-San_Francisco_California-Hotels.html
第3页是:/ Hotels-g60713-oa60-San_Francisco_California-Hotels.html
第4页是:/ Hotels-g60713-oa90-San_Francisco_California-Hotels.html
依旧......
到目前为止我的代码:
导入beatSoup_test 进口scrapy 来自scrapy.contrib.spiders导入CrawlSpider,Rule 来自scrapy.contrib.linkextractors.sgml导入SgmlLinkExtractor
class TriAdvSpider(CrawlSpider):
name = "tripAdv"
allowed_domains = ["tripadvisor.com"]
start_urls = [
"http://www.tripadvisor.com/Hotels-g60713-San_Francisco_California-Hotels.html"
]
rules = (
Rule(SgmlLinkExtractor(allow=r'-\w+.html$'), callback='parse_item', follow=True),
)
def parse_item(self, response):
beatSoup_test.getHotels(response.body_as_unicode())
其中beatSoup_test是我使用beautifulsoup的解析函数。 谢谢!
答案 0 :(得分:1)
如果您想从任何页面中删除数据。使用Xpath 这样你就可以在同一页上刮掉任何东西。
并使用项目存储已删除的数据,以便您可以抓取任意数量的内容。
以下是如何使用它的示例。
sites = Selector(text=response.body).xpath('//div[contains(@id, "identity")]//section/div/div/h3/a/text()')
items = []
items = myspiderBotItem()
items['title'] = sites.xpath('/text()').extract()
喜欢这个
class TriAdvSpider(CrawlSpider):
name = "tripAdv"
allowed_domains = ["tripadvisor.com"]
start_urls = [
"http://www.tripadvisor.com/Hotels-g60713-San_Francisco_California-Hotels.html"
]
rules = (
Rule(SgmlLinkExtractor(allow=r'-\w+.html$'), callback='parse_item', follow=True),
)
def parse_item(self, response):
# beatSoup_test.getHotels(response.body_as_unicode())
l = XPathItemLoader(item = TriAdvItem(),response = response)
for i in range(1,8):
l.add_xpath('day','//*[@id="super-container"]/div/div[1]/div[2]/div[2]/div[1]/table/tbody/tr['+str(i)+']/th[@scope="row"]/text()')
l.add_xpath('timings1','//*[@id="super-container"]/div/div[1]/div[2]/div[2]/div[1]/table/tbody/tr['+str(i)+']/td[1]/span[1]/text()')
l.add_xpath('timings2','//*[@id="super-container"]/div/div[1]/div[2]/div[2]/div[1]/table/tbody/tr['+str(i)+']/td[1]/span[2]/text()')
return l.load_item()