Question

所有！

我正在尝试制作一个python web scraper来从零售网站中提取所有产品名称。执行此操作的代码（在PyCharm中）如下：

import requests
from bs4 import BeautifulSoup

def louis_spider(max_pages):
    page = 0
    while page <= max_pages:
            url = 'https://us.testcompany.com/eng-us/women/hanbags/_/N-r4xtxc/to-' + str(page)
            source_code = requests.get(url)
            plain_text = source_code.text
            soup = BeautifulSoup(plain_text, 'html.parser')
            for eachItem in soup.findAll('main', {'class': 'content'}):
               printable = eachItem.get('id')
               print(printable)
               print('Test1')
            page += 1

louis_spider(0)

目前（上图），代码不会打印任何内容 - 甚至不会＆＃34; Test1。＆＃34;我已经通过.findAll（）＆amp; .get（）方法中的其他输入运行了这个运气： .findAll('a', {'class':'skiplinks'})和.get('href')产生了＆＃39; #content Test1＆＃39;并且.findAll('div', {'id':'privateModeMessage'})和.get('style')产生了＆＃39; display：none Test1＆＃39;。以下是检查元素的一部分＆＃39;来自网站的代码，供您参考：

a snippet of the website's code, providing context for my mentioned attempts which worked

不幸的是，我上面的代码块不会产生任何结果！当我尝试引用<main>部分中的项目时，似乎会出现问题 - 在引用直到它的行时我得到结果。理想情况下，我可以在网页上提取每个项目的名称（请参阅网站的其他快照＆代码，以获取对网站相关行的特定参考）。这些行在网站代码的<main>部分内，所以我怀疑我的for循环从未输入过这里，因为它不在{{1}内的任何其他行就像我上面的块中那些...... the way I'd write this is .findAll('a', {'class': 'productName'}): and .get('class')

话虽这么说，但我无法找到一个原因{Beautiful>我无法访问<main>中的内容。有谁知道为什么会出现这种情况？提前谢谢！

Answer 1

根据您在评论中发布的代码，您将获得一个空列表，因为XPath错误。课程productPrice位于span标记内，而不是div。

您可以通过以下方式获得所需的值：

namesElements = browser.find_elements_by_xpath("//span[@class='productPrice']")
names = []
[names.append(x.text) for x in namesElements]
print(names)

Python BeautifulSoup4 WebCrawler .findAll（）无法解析

1 个答案: