在Python中搜索许多Google搜索的链接

时间:2016-01-19 14:03:04

标签: python json web-scraping urllib scrape

我想抓取显示在Google搜索上的23000次搜索的第一个链接,并将它们附加到我正在使用的数据框中。这是我得到的错误:

Traceback (most recent call last):
File "file.py", line 26, in <module>
website = showsome(company)
File "file.py", line 18, in showsome
hits = data['results']
TypeError: 'NoneType' object has no attribute '__getitem__'

这是我到目前为止的代码:

import json
import urllib
import pandas as pd

def showsome(searchfor):
    query = urllib.urlencode({'q': searchfor})
    url = 'http://ajax.googleapis.com/ajax/services/search/web?v=1.0&%s' % query
    search_response = urllib.urlopen(url)
    search_results = search_response.read()
    results = json.loads(search_results)
    data = results['responseData']
    hits = data['results']
    d = hits[0]['visibleUrl']
    return d

company_names = pd.read_csv("my_file.csv")

websites = []
for company in company_names["Company"]:
    website = showsome(company)
    websites.append(website)
websites = pd.DataFrame(websites, columns=["Website"])

result = pd.concat([company_names,websites], axis=1, join='inner')
result.to_csv("export_file.csv", index=False, encoding="utf-8")

(出于隐私原因,我更改了输入和输出文件的名称)

谢谢!

1 个答案:

答案 0 :(得分:1)

我会尝试回答为什么会引发此异常 -

我看到谷歌检测到你并发布了一个格式化的好回复,即

{u'responseData': None, u'responseDetails': u'Suspected Terms of Service Abuse. Please see http://code.google.com/apis/errors', u'responseStatus': 403}

然后通过以下表达式将results分配给results = json.loads(search_results)

data = results['responseData']

所以None等于hits = data['results'],当您运行data['results']时,data会引发错误,因为Noneresults且它不会拥有random属性 -

我尝试使用time.sleep(random.choice((1,3,3,2,4,1,0)))模块(只是一个简单的尝试)通过一些等待来模拟真实 - (但是如果你没有谷歌的许可我坚决反对使用它 BTW我使用import json,random,time import urllib import pandas as pd def showsome(searchfor): query = urllib.urlencode({'q': searchfor}) url = 'http://ajax.googleapis.com/ajax/services/search/web?v=1.0&%s' % query search_response = urllib.urlopen(url) search_results = search_response.read() results = json.loads(search_results) data = results['responseData'] hits = data['results'] d = hits[0]['visibleUrl'] return d company_names = pd.read_csv("my_file.csv") websites = [] for company in company_names["Company"]: website = showsome(company) websites.append(website) time.sleep(random.choice((1,3,3,2,4,1,0))) print website websites = pd.DataFrame(websites, columns=["Website"]) result = pd.concat([company_names,websites], axis=1, join='inner') result.to_csv("export_file.csv", index=False, encoding="utf-8") ,如下所示。

Company,Website
American Axle,www.aam.com
American Broadcasting Company,en.wikipedia.org
American Eagle Outfitters,ae.com
American Electric Power,www.aep.com
American Express,www.americanexpress.com
American Family Insurance,www.amfam.com
American Financial Group,www.afginc.com
American Greetings,www.americangreetings.com

它生成包含 -

的csv
public class CarSpecifications {
    public static Specification<Car> withBrandName(final String name) {
        return (root, query, cb) -> {
            final Path<Brand> brandPath = root.get("brand");
            return cb.equal(brandPath.<String>get("name"), name);
        };
    }

    // TODO: withName, withType
}