谷歌搜索使用python3抓取时出现503错误 - 请求,Beautifulsoup4

时间:2017-10-04 12:29:54

标签: python web-scraping beautifulsoup http-status-code-503

我想废弃谷歌搜索的链接标题只有20页左右。 我在前一天尝试过这段代码,它正在运行!但今天,它给我发了503错误。

我搜索了解决这个问题的方法。以下是我的尝试。

  • 延迟时间(在25后插入' time.sleep(60)'代码。
  • '假用户代理'库。

但仍然看着503错误.. 这是文件。

import requests
from bs4 import BeautifulSoup
from collections import Counter

#google, '소프트웨어 교육'
base_google1_url = "https://www.google.co.kr/search?q=%EC%86%8C%ED%94%84%ED%8A%B8%EC%9B%A8%EC%96%B4+%EA%B5%90%EC%9C%A1&safe=active&ei=rv_RWYyaKcmW0gTqsa_IDg&start="
extra_google1_url="&sa=N&biw=958&bih=954"
#google, 'sw교육'
base_google2_url="https://www.google.co.kr/search?q=sw%EA%B5%90%EC%9C%A1&safe=active&ei=kLzUWYONLYa30QS4r5KACA&start="
extra_google2_url="&sa=N&biw=887&bih=950"

#book.naver, '소프트웨어 교육'
base_naver_url = "http://book.naver.com/search/search_in.nhn?query=%EC%86%8C%ED%94%84%ED%8A%B8%EC%9B%A8%EC%96%B4+%EA%B5%90%EC%9C%A1&&pattern=0&orderType=rel.desc&viewType=list&searchType=bookSearch&serviceSm=service.basic&title=&author=&publisher=&isbn=&toc=&subject=&publishStartDay=&publishEndDay=&categoryId=&qdt=1&filterType=0&filterValue=&serviceIc=service.author&buyAllow=0&ebook=0&page="

#from: https://docs.python.org/2/library/collections.html
cnt = Counter()


#bring search info
def get_html (site_name, content_num):
    _html = ""
    if site_name == 'google1':
        google1_url = base_google1_url + str(content_num) + extra_google1_url
        resp = requests.get(google1_url)
    elif site_name == 'google2':
        google2_url = base_google2_url + str(content_num) + extra_google2_url
        resp = requests.get(google2_url)
    elif site_name == 'naver':
        naver_url = base_naver_url + str(content_num)
        resp = requests.get(naver_url)

    if resp.status_code == 200:
        _html = resp.text
    return _html

def word_count (name):
    for content in name.contents:
        words = content.split()
        for word in words:
            cnt[word] += 1
    counting = cnt
    return counting



def main():

    cnt.clear()
    counting = cnt
    page_num = 0

    #bring google '소프트웨어 교육' search info~~
    while page_num < 20:
        content_num = page_num*10
        html = get_html("google1", content_num)
        soup = BeautifulSoup(html, 'html.parser')
        texts = soup.find_all('h3')
        invalid_tag = ['b']
        for text in texts:
            for match in text.find_all(invalid_tag):
                match.replaceWithChildren()
            names = text.find_all('a')
            for name in names:
                counting = word_count(name)
        page_num += 1

    page_num = 0
    #bring google 'sw교육' search info~~
    while page_num < 20:
        content_num = page_num*10
        html = get_html("google2", content_num)
        soup = BeautifulSoup(html, 'html.parser')
        texts = soup.find_all('h3')
        invalid_tag = ['b', 'a']
        for text in texts:
            for match in text.find_all(invalid_tag):
                match.replaceWithChildren()
            counting = word_count(text)
            print(text)
        page_num += 1

    #bring naver book search info~~
    page_num = 1
    while page_num < 40:
        html = get_html("naver", page_num)
        soup = BeautifulSoup(html, 'html.parser')
        texts = soup.find_all("dt")
        invalid_tag = ['a','strong', 'span', 'img']
        for text in texts:
            for match in text.find_all(invalid_tag):
                match.replaceWithChildren()
            counting = word_count(text)
        page_num += 1

    #deleting useless keywords: if need to include len(k) == 1, instead of 'len(k) == 1 and ~ ' use following code --'or (len(k) == 1 and ord(k) >=33 and ord(k)<65)'
    #https://stackoverflow.com/questions/8448202/remove-more-than-one-key-from-python-dict
    del counting['소프트웨어'], counting['교육']
    for key in [k for k in counting if len(k) == 1 or type(k) == int]: del counting[key]

    count_20 = counting.most_common(20)
    print(count_20)




if __name__ == '__main__':
    main()

请帮帮我! 先感谢您。

1 个答案:

答案 0 :(得分:0)

你可能有点想多了。对于这样的任务,我认为代码太多了。

  1. User-agentList 个用户代理)中手动添加 headers
  2. 链接列表。我认为不需要变量。

您可以这样做 (example in the online IDE):

import requests, lxml
from bs4 import BeautifulSoup

headers = {
    "User-Agent":
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3538.102 Safari/537.36 Edge/18.19582"
}

links = [
    'https://www.google.com/search?q=chuck norris',
    'https://www.google.com/search?q=minecraft fandom',
    'https://www.google.com/search?q=fus ro dah'
]

for url in links:
  html = requests.get(url, headers=headers).text
  soup = BeautifulSoup(html, 'lxml')

  for titles in soup.select('.DKV0Md'):
    title = titles.text
    print(title)
  # just for separating print results
  print()

输出:

Chuck Norris - Wikipedia
Chuck Norris: Home
Chuck Norris - IMDb
Chuck Norris | Facebook
Chuck Norris (@chucknorris) | Twitter
Chuck Norris - Age, Facts & Movies - Biography
101 Best Chuck Norris Jokes - Chuck Norris Facts - Parade
Chuck Norris, Famous Veteran | Military.com
These Chuck Norris Facts Will Make You Love Him Even More ...

Official Minecraft Wiki – The Ultimate Resource for Minecraft
Official Minecraft Wiki - Minecraft Wiki - Fandom
the minecraft fandom shut down : Minecraft - Reddit
900+ Minecraft Fandom ideas in 2021 | dream team, my ...
14 Minecraft Fandom ideas | minecraft fan art, dream team, my ...
Minecraft Fandom - Minecraft Wiki Guide - IGN

Unrelenting Force (Skyrim) | Elder Scrolls | Fandom
Fus Ro Dah | Know Your Meme
Fus ro dah - Urban Dictionary
Skyrim:Unrelenting Force - The Unofficial Elder Scrolls Pages ...
Fus | Thuum.org - The Dragon Language Dictionary
60 “Fus ro dah!” (The Elder Scrolls V: Skyrim) ideas | skyrim ...

或者,您可以使用来自 SerpApi 的 Google Search Engine Results API。这是一个付费 API,可免费试用 5,000 次搜索。

要集成的代码:

from serpapi import GoogleSearch

links = [
  'fus ro dah',
  'minecraft lets play',
  'gordon ramsay memes',
  ]

for url in links:
  params = {
    "api_key": "YOUR_API_KEY",
    "engine": "google",
    "q": url,
    "google_domain": "google.com",
  }

  search = GoogleSearch(params)
  results = search.get_dict()

  for result in results['organic_results']:
    title = result['title']
    print(title)
  print()

输出:

Unrelenting Force (Skyrim) | Elder Scrolls | Fandom
Fus Ro Dah | Know Your Meme
Fus ro dah - Urban Dictionary
Skyrim:Unrelenting Force - The Unofficial Elder Scrolls Pages ...
Fus | Thuum.org - The Dragon Language Dictionary
60 “Fus ro dah!” (The Elder Scrolls V: Skyrim) ideas | skyrim ...

The Fun Begins! | Let's Play Minecraft Survival Episode 1 ...
Beginning a NEW Minecraft Adventure! | Let's Play Minecraft ...
Minecraft: A New Beginning - 1.16 Survival Let's play | Ep 1 ...
An Epic New Minecraft Adventure - 1.16 Survival Let's Play ...
STARTING A NEW WORLD! - 1.16.2 Lets Play)
A New Start in Minecraft 1.16.5 (Survival Let's Play) Episode 1 ...
minecraft lets plays be like - YouTube
A NEW MINECRAFT JOURNEY!!! - Minecraft 1.16 Survival ...
Let's Play Minecraft 1.16 - Getting Started on a New World ...
Let's Play Minecraft Episode 1 - YouTube

These 29 Memes Of Gordon Ramsay Insulting People Are Too ...
51 Best Gordan Ramsey Meme ideas | ramsey, gordon ...
70 Gordon Ramsay Memes! ideas - Pinterest
50+ Iconic Gordon Ramsay Memes, Quotes, And Hilarious ...
Gordon Ramsay Memes - Pinterest
56 Gordan ramsey meme ideas | gordon ramsay funny ...
Gordon Ramsay Humor - Pinterest
The Best Chef Ramsay Memes That Capture His Endless ...
<块引用>

免责声明,我为 SerpApi 工作。