使用Python(或R)提取Google学术搜索结果

时间:2012-11-02 18:02:30

标签: python r google-scholar

我想用python来刮取google学者搜索结果。我找到了两个不同的脚本,一个是gscholar.py,另一个是scholar.py(可以用作python库吗?)。

现在,我应该说我对python完全不熟悉,很抱歉,如果我错过了明显的话!

问题是当我按照README文件中的说明使用gscholar.py时,我得到了结果

query() takes at least 2 arguments (1 given)

即使我指定了另一个参数(例如gscholar.query("my query", allresults=True),我也会

query() takes at least 2 arguments (2 given)

这让我很困惑。我还尝试指定第三个可能的参数(outformat=4;这是BibTex格式),但这给了我一个函数错误列表。一位同事建议我在运行查询之前导入BeautifulSoup和this,但这也不会改变问题。有什么建议如何解决问题?

我发现R的代码(请参阅link)作为解决方案,但很快被谷歌阻止了。也许有人可以建议如何改进代码以避免被阻止?任何帮助,将不胜感激!谢谢!

7 个答案:

答案 0 :(得分:13)

我建议您不要使用特定的库来抓取特定的网站,而是使用经过充分测试并且格式良好的文档(如BeautifulSoup)的通用HTML库。

要访问包含浏览器信息的网站,您可以将url opener类与自定义用户代理一起使用:

from urllib import FancyURLopener
class MyOpener(FancyURLopener):
    version = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/33.0.1750.152 Safari/537.36'
openurl = MyOpener().open

然后按如下方式下载所需的网址:

openurl(url).read()

要检索学者搜索结果,只需使用http://scholar.google.se/scholar?hl=en&q=${query}网址。

要从检索到的HTML文件中提取信息,您可以使用以下代码:

from bs4 import SoupStrainer, BeautifulSoup
page = BeautifulSoup(openurl(url).read(), parse_only=SoupStrainer('div', id='gs_ab_md'))

这段代码提取了一个具体的div元素,其中包含Google学术搜索结果页中显示的结果数。

答案 1 :(得分:5)

Google会阻止你......因为很明显你不是浏览器。也就是说,他们会检测出人类活动频繁发生的相同请求签名....

你可以这样做:

答案 2 :(得分:3)

看起来像使用Python进行抓取而R遇到了Google Scholar将您的请求视为机器人查询的问题,因为请求中缺少用户代理。 StackExchange中有一个关于downloading all pdfs linked from a web page的类似问题,答案是用户在Unix中使用wget,在Python中使用BeautifulSoup包。

Curl似乎也是一个更有希望的方向。

答案 3 :(得分:2)

COPython看起来是正确的,但这里有一个例子的解释......

考虑f:

def f(a,b,c=1):
    pass

f无论如何都要求a和b的值。你可以留空。

f(1,2)     #executes fine
f(a=1,b=2) #executes fine
f(1,c=1)   #TypeError: f() takes at least 2 arguments (2 given)

您被Google阻止的事实可能是由于您的标题中的用户代理设置...我不熟悉R但我可以为您提供解决此问题的一般算法:

  1. 使用普通浏览器(firefox或其他)访问网址,同时监控HTTP流量(我喜欢wireshark)
  2. 记下在相应的http请求中发送的所有标头
  3. 尝试运行脚本并注意标题
  4. 发现差异
  5. 设置您的R脚本以使用您在检查浏览器流量时看到的标题

答案 4 :(得分:1)

这是query()...

的调用签名
def query(searchstr, outformat, allresults=False)

因此你需要至少指定一个searchstr和一个outformat,allresults是一个可选的标志/参数。

答案 5 :(得分:0)

您可能希望使用Greasemonkey执行此任务。优势在于,如果您另外保持请求频率,Google将无法将您检测为机器人。您还可以在浏览器窗口中观看脚本。

您可以自学代码或使用these来源之一的脚本。

答案 6 :(得分:-1)

披露:我在SerpApi工作。


您可以使用google-search-results包从Google学术搜索中提取数据。 Check a demo at Repl.it

from serpapi.google_search_results import GoogleSearchResults

params = {
  "engine": "google_scholar",
  "q": "coffee",
}

client = GoogleSearchResults(params)
data = client.get_dict()

print("Organic results\n")

for result in data['organic_results']:
  print(f"""Title: {result['title']}
Result ID: {result['result_id']}
Link: {result['link']}
""")

响应

{
  "organic_results": [
    {
      "position": 0,
      "title": "Phenolic compounds in coffee",
      "result_id": "re9ssrU-exUJ",
      "type": "Html",
      "link": "http://www.scielo.br/scielo.php?pid=S1677-04202006000100003&script=sci_arttext",
      "snippet": "Phenolic compounds are secondary metabolites generally involved in plant adaptation to environmental stress conditions. Chlorogenic acids (CGA) and related compounds are the main components of the phenolic fraction of green coffee beans, reaching levels up to …",
      "publication_info": {
        "summary": "A Farah, CM Donangelo - Brazilian journal of plant physiology, 2006 - SciELO Brasil"
      },
      "resources": [
        {
          "title": "scielo.br",
          "file_format": "HTML",
          "link": "http://www.scielo.br/scielo.php?pid=S1677-04202006000100003&script=sci_arttext"
        }
      ],
      "inline_links": {
        "serpapi_cite_link": "https://serpapi.com/search.json?engine=google_scholar_cite&q=re9ssrU-exUJ",
        "html_version": "https://scholar.google.comhttp://www.scielo.br/scielo.php?pid=S1677-04202006000100003&script=sci_arttext",
        "cited_by": {
          "total": 608,
          "link": "https://scholar.google.com/scholar?cites=1547899847035383725&as_sdt=5,44&sciodt=0,44&hl=en",
          "serpapi_scholar_link": "https://serpapi.com/search.json?cites=1547899847035383725&engine=google_scholar&hl=en&q=Coffee"
        },
        "related_pages_link": "https://scholar.google.com/scholar?q=related:re9ssrU-exUJ:scholar.google.com/&scioq=Coffee&hl=en&as_sdt=0,44",
        "versions": {
          "total": 6,
          "link": "https://scholar.google.com/scholar?cluster=1547899847035383725&hl=en&as_sdt=0,44",
          "serpapi_scholar_link": "https://serpapi.com/search.json?cluster=1547899847035383725&engine=google_scholar&hl=en&q=Coffee"
        },
        "cached_page_link": "https://scholar.google.comhttp://scholar.googleusercontent.com/scholar?q=cache:re9ssrU-exUJ:scholar.google.com/+Coffee&hl=en&as_sdt=0,44"
      }
    },
    {
      "position": 1,
      "title": "Functional properties of coffee and coffee by-products",
      "result_id": "9WouRiFbIK4J",
      "link": "https://www.sciencedirect.com/science/article/pii/S0963996911003449",
      "snippet": "Coffee, one of the most popular beverages, is consumed by millions of people every day. Traditionally, coffee beneficial effects have been attributed solely to its most intriguing and investigated ingredient, caffeine, but it is now known that other compounds also contribute to …",
      "publication_info": {
        "summary": "P Esquivel, VM Jiménez - Food Research International, 2012 - Elsevier",
        "authors": [
          {
            "name": "P Esquivel",
            "link": "https://scholar.google.com/citations?user=EpwJXskAAAAJ&hl=en&oi=sra"
          },
          {
            "name": "VM Jiménez",
            "link": "https://scholar.google.com/citations?user=_P0h0B8AAAAJ&hl=en&oi=sra"
          }
        ]
      },
      "resources": [
        {
          "title": "uoregon.edu",
          "file_format": "PDF",
          "link": "https://pages.uoregon.edu/chendon/coffee_literature/2012%20Food%20Res.%20Int.,%20Uses%20for%20coffee%20waste.pdf"
        }
      ],
      "inline_links": {
        "serpapi_cite_link": "https://serpapi.com/search.json?engine=google_scholar_cite&q=9WouRiFbIK4J",
        "cited_by": {
          "total": 531,
          "link": "https://scholar.google.com/scholar?cites=12547128760323697397&as_sdt=5,44&sciodt=0,44&hl=en",
          "serpapi_scholar_link": "https://serpapi.com/search.json?cites=12547128760323697397&engine=google_scholar&hl=en&q=Coffee"
        },
        "related_pages_link": "https://scholar.google.com/scholar?q=related:9WouRiFbIK4J:scholar.google.com/&scioq=Coffee&hl=en&as_sdt=0,44",
        "versions": {
          "total": 9,
          "link": "https://scholar.google.com/scholar?cluster=12547128760323697397&hl=en&as_sdt=0,44",
          "serpapi_scholar_link": "https://serpapi.com/search.json?cluster=12547128760323697397&engine=google_scholar&hl=en&q=Coffee"
        }
      }
    },
    {
      "position": 2,
      "title": "Coffee constituents",
      "result_id": "xY3q9qnkN54J",
      "link": "https://books.google.com/books?hl=en&lr=&id=y0qA89vCr3MC&oi=fnd&pg=PT47&dq=Coffee&ots=pyKSUohpI7&sig=8qULQFDS2RydGAkXlRyVJoph4AU",
      "snippet": "Coffee has been for decades the most commercialized food product and most widely consumed beverage in the world. Since the opening of the first coffee house in Mecca at the end of the fifteenth century, coffee consumption has greatly increased all around the world …",
      "publication_info": {
        "summary": "A Farah - Coffee: Emerging health effects and disease …, 2012 - books.google.com"
      },
      "resources": [
        {
          "title": "academia.edu",
          "file_format": "PDF",
          "link": "http://www.academia.edu/download/52419982/IFTPressBook_Coffee_PreviewChapter.pdf"
        }
      ],
      "inline_links": {
        "serpapi_cite_link": "https://serpapi.com/search.json?engine=google_scholar_cite&q=xY3q9qnkN54J",
        "cited_by": {
          "total": 255,
          "link": "https://scholar.google.com/scholar?cites=11400832400354872773&as_sdt=5,44&sciodt=0,44&hl=en",
          "serpapi_scholar_link": "https://serpapi.com/search.json?cites=11400832400354872773&engine=google_scholar&hl=en&q=Coffee"
        },
        "related_pages_link": "https://scholar.google.com/scholar?q=related:xY3q9qnkN54J:scholar.google.com/&scioq=Coffee&hl=en&as_sdt=0,44",
        "versions": {
          "total": 7,
          "link": "https://scholar.google.com/scholar?cluster=11400832400354872773&hl=en&as_sdt=0,44",
          "serpapi_scholar_link": "https://serpapi.com/search.json?cluster=11400832400354872773&engine=google_scholar&hl=en&q=Coffee"
        }
      }
    }
  ]
}

输出

Organic results

Title: Phenolic compounds in coffee
Result ID: re9ssrU-exUJ
Link: http://www.scielo.br/scielo.php?pid=S1677-04202006000100003&script=sci_arttext

Title: Functional properties of coffee and coffee by-products
Result ID: 9WouRiFbIK4J
Link: https://www.sciencedirect.com/science/article/pii/S0963996911003449

Title: Coffee constituents
Result ID: xY3q9qnkN54J
Link: https://books.google.com/books?hl=en&lr=&id=y0qA89vCr3MC&oi=fnd&pg=PT47&dq=coffee&ots=pyKSUokkMc&sig=sjDv_w50O-5_svJDJKPJ7hHJtRg

Title: All about coffee
Result ID: fGeQlvu-2_IJ
Link: https://books.google.com/books?hl=en&lr=&id=oJxpQX4ko7cC&oi=fnd&pg=PT1&dq=coffee&ots=Oih_E-45Y-&sig=KYyBOoSXwRdwOv5upyWwl0FzIq8

Title: Biotechnological potential of coffee pulp and coffee husk for bioprocesses
Result ID: Zu7aKNjvAUwJ
Link: https://www.sciencedirect.com/science/article/pii/S1369703X0000084X

Title: Biodiversity conservation in traditional coffee systems of Mexico
Result ID: pIjQPO7__AYJ
Link: https://conbio.onlinelibrary.wiley.com/doi/abs/10.1046/j.1523-1739.1999.97153.x

Title: Coffee flavor chemistry
Result ID: UwtLySK5iawJ
Link: https://books.google.com/books?hl=en&lr=&id=NQi1LYJxFvUC&oi=fnd&pg=PP13&dq=coffee&ots=dRSace3WYu&sig=5jyqtvqkL_jGDkWTLsLqksKiQUw

Title: Coffee and health: a review of recent human research
Result ID: fSVlrXX7dIUJ
Link: https://www.tandfonline.com/doi/abs/10.1080/10408390500400009

Title: M-Coffee: combining multiple sequence alignment methods with T-Coffee
Result ID: _3o-xhuGyg0J
Link: https://academic.oup.com/nar/article-abstract/34/6/1692/2401531

Title: Producing decaffeinated coffee plants
Result ID: VJySkcFsQ1EJ
Link: https://www.nature.com/articles/423823a

如果需要更多信息,请查看SerpApi documentationlive playground

Playground screenshot