Question

运行我的刮刀我可以看到它从yell.com中取不出任何东西。到目前为止，我知道Xpaths没问题。无法确定我是否犯过任何错误。希望有任何解决方法。我尝试使用以下代码：

import requests
from lxml import html

url="https://www.yell.com/ucs/UcsSearchAction.do?keywords=pizza&location=all+states&scrambleSeed=821749505"
def Startpoint(address):
    response = requests.get(address)
    tree = html.fromstring(response.text)
    titles = tree.xpath('//div[contains(@class,"col-sm-24")]')
    for title in titles:
        try:
            Name=title.xpath('.//h2[@itemprop="name"]/text()')[0]
            print(Name)
        except exception as e:
            print(e.message)
            continue
Startpoint(url)

Answer 1

您需要指定假装为真实浏览器的User-Agent字符串：

response = requests.get(address, headers={'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.81 Safari/537.36'})

其他一些说明：

Exception以大写字母开头
你不应该在你的定位器中使用col-sm-24类 - 这种引导类是特定于布局的，并没有真正带来任何数据容器特定类型的信息。请改用businessCapsule类：
```
titles = tree.xpath("//div[contains(concat(' ', @class, ' '), ' businessCapsule ')]")
```
请注意我们properly check the class attribute here。

您可以使用findtext()方法查找结果标题：

results = tree.xpath("//div[contains(concat(' ', @class, ' '), ' businessCapsule ')]")

for result in results:
    name = result.findtext('.//h2[@itemprop="name"]')
    print(name)

来自网页的名称不会被刮掉

1 个答案: