我是学生,而我正在收集有关品牌的信息。我发现这个名为Kit:Kit Page的网站,我想为品牌寻找。它有近500页,我在Python 3中编写了一个Scrapy Spider,它遍历每个页面并将列表复制到字典中,但我无法弄清楚xpath或css实际获取列表信息。这是我的items.py:
import scrapy
class KitcreatorwebscraperItem(scrapy.Item):
creator = scrapy.Field()
这是我的蜘蛛:
import scrapy
class KitCreatorSpider(scrapy.Spider):
name = "kitCreators"
pageNumber = 1
start_urls = [
'https://kit.com/brands?page=1',
]
while pageNumber <= 478:
newUrl = "https://kit.com/brands?page=" + str(pageNumber)
start_urls.append(newUrl)
pageNumber += 1
def parse(self, response):
for li in response.xpath('//div[@class="section group"][0]'):
它运行成功,但我无法编写获取所需数据的xpath。什么路径是必要的,我如何在代码中实现它?
答案 0 :(得分:0)
您可以在Xpath
下方尝试提取品牌名称:
//a[@class="brandsView-list-item-link ng-binding"]/text()
P.S。我建议你不要创建URL列表。它似乎是多余的代码。相反,您可以使用for
循环:
for page_number in range(479):
url = "https://kit.com/brands?page=%s" % page_number
...handle current page source...
更新
You can try Selenium
+ PhantomJS
从动态内容中获取所需数据:
from selenium import webdriver
driver = webdriver.PhantomJS()
brands_list = []
for page in range(1, 480):
driver.get("https://kit.com/brands?page=%s" % page)
[brands_list.append(brand.text) for brand in driver.find_elements_by_xpath('//a[@class="brandsView-list-item-link ng-binding"]')]
print(brands_list)