我正在尝试使用以下代码抓取google新闻:
from bs4 import BeautifulSoup
import requests
import time
from random import randint
def scrape_news_summaries(s):
time.sleep(randint(0, 2)) # relax and don't let google be angry
r = requests.get("http://www.google.co.uk/search?q="+s+"&tbm=nws")
content = r.text
news_summaries = []
soup = BeautifulSoup(content, "html.parser")
st_divs = soup.findAll("div", {"class": "st"})
for st_div in st_divs:
news_summaries.append(st_div.text)
return news_summaries
l = scrape_news_summaries("T-Notes")
#l = scrape_news_summaries("""T-Notes""")
for n in l:
print(n)
即使这段代码以前工作过,我现在也无法弄清楚为什么它不再起作用了。我是否有可能被谷歌禁止,因为我只运行了3到4次代码? (我尝试使用Bing News也带来了不幸的空洞结果......)
感谢。
答案 0 :(得分:2)
我尝试运行代码,它在我的计算机上运行正常。
您可以尝试打印请求的状态代码,看看它是否为200以外的其他内容。
from bs4 import BeautifulSoup
import requests
import time
from random import randint
def scrape_news_summaries(s):
time.sleep(randint(0, 2)) # relax and don't let google be angry
r = requests.get("http://www.google.co.uk/search?q="+s+"&tbm=nws")
print(r.status_code) # Print the status code
content = r.text
news_summaries = []
soup = BeautifulSoup(content, "html.parser")
st_divs = soup.findAll("div", {"class": "st"})
for st_div in st_divs:
news_summaries.append(st_div.text)
return news_summaries
l = scrape_news_summaries("T-Notes")
#l = scrape_news_summaries("""T-Notes""")
for n in l:
print(n)
https://www.scrapehero.com/how-to-prevent-getting-blacklisted-while-scraping/获取状态代码列表,表明您已被禁止。