了解google.search()中的参数

时间:2017-08-12 19:39:23

标签: python web-scraping google-search

我正在尝试获取特定图书缩写的前5个网址。我将参数num设置为5,我假设它将返回前5个结果,stop = 1,我将其解释为表示在返回5个结果后,将不再发送HTTP请求。出于某种原因,当我设置num = 5并且stop = 1时,我只得到3个结果,并且我得到了相同的3个搜索结果(显然应该是不同的)。此外,我在测试解决此问题时遇到HTTP错误503,尽管睡眠循环,此站点上的其他人建议将防止该错误。我的代码如下......

    import random
    import time

    count = 0

    my_file = open('sometextfile.txt','r')

    for aline in my_file:
        print("******************************")
        print(aline)
        count += 1
        record_list = aline.split("\t")

        if "." in record_list[1]:
            search_results = google.search(record_list[2],num=5,stop=1,pause=3.)
            for result in search_results:
                print(result)
        time.sleep(random.randrange(0,3))

并具有以下输出...

    4   Environmental and Behaviour ['0143-005X']

******************************
4   Sustainable Cities and Society  ['0143-005X']

******************************
4   Chicago to LA: Making sense of urban theory ['0272-4944']

******************************
4   As adopted by the International Health Conference   ['0272-4944']

******************************
5   J. Wetl.    ['1442-9985']

https://www.ncbi.nlm.nih.gov/nlmcatalog?term=1442-9985%5BISSN%5D
http://www.wiley.com/bw/journal.asp?ref=1442-9985
http://www.wiley.com/WileyCDA/WileyTitle/productCd-AEC.html
******************************
5   Curr. Opin. Environ. Sustain.   ['1442-9985']

https://www.ncbi.nlm.nih.gov/nlmcatalog?term=1442-9985%5BISSN%5D
http://www.wiley.com/bw/journal.asp?ref=1442-9985
http://www.wiley.com/WileyCDA/WileyTitle/productCd-AEC.html
******************************
5   For. Policy Econ.   ['1442-9985']

https://www.ncbi.nlm.nih.gov/nlmcatalog?term=1442-9985%5BISSN%5D
http://www.wiley.com/bw/journal.asp?ref=1442-9985
http://www.wiley.com/WileyCDA/WileyTitle/productCd-AEC.html
******************************
5   For. Policy Econ.   ['1442-9985']

https://www.ncbi.nlm.nih.gov/nlmcatalog?term=1442-9985%5BISSN%5D
http://www.wiley.com/bw/journal.asp?ref=1442-9985
http://www.wiley.com/WileyCDA/WileyTitle/productCd-AEC.html
******************************
5   Asia. World Dev.    ['1442-9985']

Traceback (most recent call last):
  File "C:/Users/Peter/Desktop/Programming/Ibata Arens Project/google_search.py", line 27, in <module>
    for result in search_results:
  File "C:\Users\Peter\Anaconda3\lib\site-packages\google\__init__.py", line 304, in search
    html = get_page(url)
  File "C:\Users\Peter\Anaconda3\lib\site-packages\google\__init__.py", line 121, in get_page
    response = urlopen(request)
  File "C:\Users\Peter\Anaconda3\lib\urllib\request.py", line 163, in urlopen
    return opener.open(url, data, timeout)
  File "C:\Users\Peter\Anaconda3\lib\urllib\request.py", line 472, in open
    response = meth(req, response)
  File "C:\Users\Peter\Anaconda3\lib\urllib\request.py", line 582, in http_response
    'http', request, response, code, msg, hdrs)
  File "C:\Users\Peter\Anaconda3\lib\urllib\request.py", line 504, in error
    result = self._call_chain(*args)
  File "C:\Users\Peter\Anaconda3\lib\urllib\request.py", line 444, in _call_chain
    result = func(*args)
  File "C:\Users\Peter\Anaconda3\lib\urllib\request.py", line 696, in http_error_302
    return self.parent.open(new, timeout=req.timeout)
  File "C:\Users\Peter\Anaconda3\lib\urllib\request.py", line 472, in open
    response = meth(req, response)
  File "C:\Users\Peter\Anaconda3\lib\urllib\request.py", line 582, in http_response
    'http', request, response, code, msg, hdrs)
  File "C:\Users\Peter\Anaconda3\lib\urllib\request.py", line 510, in error
    return self._call_chain(*args)
  File "C:\Users\Peter\Anaconda3\lib\urllib\request.py", line 444, in _call_chain
    result = func(*args)
  File "C:\Users\Peter\Anaconda3\lib\urllib\request.py", line 590, in http_error_default
    raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 503: Service Unavailable

我也想知道是否更好地简单地使用urllib并且通过返回的html来代替,因为我的目标只是检索每个缩写书名的ISSN。

0 个答案:

没有答案