Question

过去10小时我一直在研究这个问题，但我仍然无法解决。该代码适用于某些人，但它对我不起作用。

主要目的是为https://www.google.com.au/webhp?num=100&gl=au&hl=en#q=site:focusonfurniture.com.au&gl=au&hl=en&start=0

的所有网页提取Google结果网址

这是我的代码：

# -*- coding: utf-8
from bs4 import BeautifulSoup
import urllib, urllib2

def google_scrape(query):
    address = "https://www.google.com.au/webhp?num=100&gl=au&hl=en#q=site:focusonfurniture.com.au&gl=au&hl=en&start=0".format (urllib.quote_plus(query))
    request = urllib2.Request(address, None, {'User-Agent':'Mozilla/43.0.1'})
    urlfile = urllib2.urlopen(request)
    html = urlfile.read()
    soup = BeautifulSoup(html)
    linkdictionary = {}

    for li in soup.findAll('div', attrs={'class' : 'g'}): # It never goes inside this for loop as find.All results Null

        sLink = li.find('.r a')
        print sLink['href']

    return linkdictionary

if __name__ == '__main__':
    links = google_scrape('beautifulsoup')
    print links

我收到了{}。代码soup.findAll('div', attrs={'class' : 'g'})返回null，因此，我无法获取任何结果。

我正在使用BS4和Python 2.7。请帮助我解释代码无法正常工作的原因。任何帮助将不胜感激。

此外，如果有人能够了解为什么相同的代码适用于某些人而不适用于其他人，那将会很棒？（上次发生在我身上）。感谢。

Answer 1

这是你可以做的一个例子。你需要selenium和phantomjs（这可以模拟浏览器）

import selenium.webdriver
from pprint import pprint
import re 

url = 'https://www.google.com.au/webhp?num=100&gl=au&hl=en#q=site:focusonfurniture.com.au&gl=au&hl=en&start=0'
driver = selenium.webdriver.PhantomJS()
driver.get(url)
html =  driver.page_source


regex = r"<cite>(https:\/\/www\.focusonfurniture\.com\.au\/[\/A-Z]+)<\/cite>"

result = re.findall(re.compile(regex, re.IGNORECASE | re.MULTILINE),html)
for url in result:
    print url

driver.quit()

结果：

https://www.focusonfurniture.com.au/delivery/
https://www.focusonfurniture.com.au/terms/
https://www.focusonfurniture.com.au/disclaimer/
https://www.focusonfurniture.com.au/dining/
https://www.focusonfurniture.com.au/bedroom/
https://www.focusonfurniture.com.au/catalogue/
https://www.focusonfurniture.com.au/mattresses/
https://www.focusonfurniture.com.au/clearance/
https://www.focusonfurniture.com.au/careers/

soup.findAll（）为div类属性Beautifulsoup返回null

1 个答案: