过去10小时我一直在研究这个问题,但我仍然无法解决。该代码适用于某些人,但它对我不起作用。
主要目的是为https://www.google.com.au/webhp?num=100&gl=au&hl=en#q=site:focusonfurniture.com.au&gl=au&hl=en&start=0
这是我的代码:
# -*- coding: utf-8
from bs4 import BeautifulSoup
import urllib, urllib2
def google_scrape(query):
address = "https://www.google.com.au/webhp?num=100&gl=au&hl=en#q=site:focusonfurniture.com.au&gl=au&hl=en&start=0".format (urllib.quote_plus(query))
request = urllib2.Request(address, None, {'User-Agent':'Mozilla/43.0.1'})
urlfile = urllib2.urlopen(request)
html = urlfile.read()
soup = BeautifulSoup(html)
linkdictionary = {}
for li in soup.findAll('div', attrs={'class' : 'g'}): # It never goes inside this for loop as find.All results Null
sLink = li.find('.r a')
print sLink['href']
return linkdictionary
if __name__ == '__main__':
links = google_scrape('beautifulsoup')
print links
我收到了{}
。代码soup.findAll('div', attrs={'class' : 'g'})
返回null,因此,我无法获取任何结果。
我正在使用BS4和Python 2.7。请帮助我解释代码无法正常工作的原因。任何帮助将不胜感激。
此外,如果有人能够了解为什么相同的代码适用于某些人而不适用于其他人,那将会很棒? (上次发生在我身上)。 感谢。
答案 0 :(得分:0)
这是你可以做的一个例子。 你需要selenium和phantomjs(这可以模拟浏览器)
import selenium.webdriver
from pprint import pprint
import re
url = 'https://www.google.com.au/webhp?num=100&gl=au&hl=en#q=site:focusonfurniture.com.au&gl=au&hl=en&start=0'
driver = selenium.webdriver.PhantomJS()
driver.get(url)
html = driver.page_source
regex = r"<cite>(https:\/\/www\.focusonfurniture\.com\.au\/[\/A-Z]+)<\/cite>"
result = re.findall(re.compile(regex, re.IGNORECASE | re.MULTILINE),html)
for url in result:
print url
driver.quit()
结果:
https://www.focusonfurniture.com.au/delivery/
https://www.focusonfurniture.com.au/terms/
https://www.focusonfurniture.com.au/disclaimer/
https://www.focusonfurniture.com.au/dining/
https://www.focusonfurniture.com.au/bedroom/
https://www.focusonfurniture.com.au/catalogue/
https://www.focusonfurniture.com.au/mattresses/
https://www.focusonfurniture.com.au/clearance/
https://www.focusonfurniture.com.au/careers/