来自网页的名称不会被刮掉

时间:2017-05-03 06:05:38

标签: python web-scraping web-crawler

运行我的刮刀我可以看到它从yell.com中取不出任何东西。到目前为止,我知道Xpaths没问题。无法确定我是否犯过任何错误。希望有任何解决方法。我尝试使用以下代码:

import requests
from lxml import html

url="https://www.yell.com/ucs/UcsSearchAction.do?keywords=pizza&location=all+states&scrambleSeed=821749505"
def Startpoint(address):
    response = requests.get(address)
    tree = html.fromstring(response.text)
    titles = tree.xpath('//div[contains(@class,"col-sm-24")]')
    for title in titles:
        try:
            Name=title.xpath('.//h2[@itemprop="name"]/text()')[0]
            print(Name)
        except exception as e:
            print(e.message)
            continue
Startpoint(url)

1 个答案:

答案 0 :(得分:1)

您需要指定假装为真实浏览器的User-Agent字符串

response = requests.get(address, headers={'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.81 Safari/537.36'})

其他一些说明:

  • Exception以大写字母开头
  • 你不应该在你的定位器中使用col-sm-24类 - 这种引导类是特定于布局的,并没有真正带来任何数据容器特定类型的信息。请改用businessCapsule类:

    titles = tree.xpath("//div[contains(concat(' ', @class, ' '), ' businessCapsule ')]")
    

    请注意我们properly check the class attribute here

  • 您可以使用findtext()方法查找结果标题:

    results = tree.xpath("//div[contains(concat(' ', @class, ' '), ' businessCapsule ')]")
    
    for result in results:
        name = result.findtext('.//h2[@itemprop="name"]')
        print(name)