Question

我从电子表格中提取了大约100,00个值，然后获取第一个结果以查看它们是http还是https。脚本工作正常（足够我的目的）但我在循环的第70次迭代后得到503错误。

关于如何获得我需要的查询数量的任何想法/想法/建议？

代码：

import pandas as pd
import re
import time
library_list = pd.read_csv("PLS_FY2014_AE_pupld14a.csv")

zero = 0
with_https = 0 

for i in library_list['LIBNAME']:
    for url in search(library_list['LIBNAME'][zero], num = 1, start = 0, stop = 1):
        time.sleep(5)
        zero += 1
        print(zero)
        if 'https' in url:
            with_https += 1

Answer 1

我正在尝试做同样的事情，我在30-50结果后得到503错误。我最终迫使搜索等待每次搜索30到60秒之间的随机时间。我读过其他人有同样的问题，他们说谷歌限制机器人搜索到每小时50左右。我使用的代码是

import os, arcpy, urllib, ssl, time, datetime, random, errno
from datetime import datetime
from arcpy import env
from distutils.dir_util import copy_tree
try:
    from google import search
except ImportError:
    print("No module named 'google' found")
from google import search
with arcpy.da.UpdateCursor(facilities, ["NAME", "Weblinks", "ADDRESSSTATECODE", "MP_TYPE"]) as rows:
    for row in rows:
        if row[1] is None:
            if row[3] != "xxxxxx":
                query = str(row[0])
                print("The query will be " + query)
                wt = random.uniform(30,60)
                print("Script will wait " + str(wt) + " seconds before the next search.")
                for j in search("recreation.gov " + query + ", " + str(row[2]), tld="co.in", num=1, stop=1, pause=wt):
                    row[1] = str(j)
                    rows.updateRow(row)
                    print(row[1])
                    time.sleep(5)
                    print("")

我的脚本已经运行了7天，现在不间断，没有更多错误。它可能很慢，但最终它会完成工作。本轮我正在做大约18,000次搜索。

使用Python的Google模块查询限制

1 个答案: