从Google学术搜索结果中抓取并解析引文信息

时间:2019-05-20 11:47:04

标签: python web-scraping beautifulsoup google-scholar

我有大约2万篇文章的标题列表,我想从Google Scholar中删除其引用次数。我是BeautifulSoup库的新手。我有以下代码:

import requests
from bs4 import BeautifulSoup

query = ['Role for migratory wild birds in the global spread of avian 
 influenza H5N8','Uncoupling conformational states from activity in an 
 allosteric enzyme','Technological Analysis of the World’s Earliest 
 Shamanic Costume: A Multi-Scalar, Experimental Study of a Red Deer 
 Headdress from the Early Holocene Site of Star Carr, North Yorkshire, 
 UK','Oxidative potential of PM 2.5  during Atlanta rush hour: 
 Measurements of in-vehicle dithiothreitol (DTT) activity','Primary 
 Prevention of CVD','Growth and Deposition of Au Nanoclusters on Polymer- 
 wrapped Graphene and Their Oxygen Reduction Activity','Relations of 
 Preschoolers Visual-Motor and Object Manipulation Skills With Executive 
 Function and Social Behavior','We Know Who Likes Us, but Not Who Competes 
 Against Us']

url = 'https://scholar.google.com/scholar?q=' + query + '&ie=UTF-8&oe=UTF- 
       8&hl=en&btnG=Search'

content = requests.get(url).text
page = BeautifulSoup(content, 'lxml')
results = []
for entry in page.find_all("h3", attrs={"class": "gs_rt"}):
    results.append({"title": entry.a.text, "url": entry.a['href']})

,但仅返回标题和URL。我不知道如何从另一个标签获取引用信息。请在这里帮助我。

2 个答案:

答案 0 :(得分:1)

您需要循环播放列表。您可以使用Session来提高效率。以下是针对bs 4.7.1的支持,它支持MemoryError Traceback (most recent call last) <ipython-input-10-9aa6e73eb179> in <module> 20 rr = encrelrot(v_positions, faces, r_v_positions, f_neighbors) 21 ---> 22 modelout = dectrans(decrelrot(rr, f_neighbors), faces, r_v_positions) <ipython-input-8-cdb51dd3cadf> in dectrans(features, faces, template) 616 print("Size dx", dx.nnz) 617 #M = M.tocsr() --> 618 model = scipy.sparse.linalg.spsolve(Mt @ M, Mt.dot(dx)) 619 620 return model ~/anaconda3/lib/python3.6/site-packages/scipy/sparse/base.py in __matmul__(self, other) 560 raise ValueError("Scalar operands are not allowed, " 561 "use '*' instead") --> 562 return self.__mul__(other) 563 564 def __rmatmul__(self, other): ~/anaconda3/lib/python3.6/site-packages/scipy/sparse/base.py in __mul__(self, other) 480 if self.shape[1] != other.shape[0]: 481 raise ValueError('dimension mismatch') --> 482 return self._mul_sparse_matrix(other) 483 484 # If it's a list or whatever, treat it like a matrix ~/anaconda3/lib/python3.6/site-packages/scipy/sparse/compressed.py in _mul_sparse_matrix(self, other) 494 other.indptr, other.indices), 495 maxval=M*N) --> 496 indptr = np.empty(major_axis + 1, dtype=idx_dtype) 497 498 fn = getattr(_sparsetools, self.format + '_matmat_pass1') MemoryError: 伪类来查找引用计数。看起来您可以从css选择器中删除:contains类型选择器,而只需在h3之前使用类,即a。如果您没有4.7.1。您可以改用.gs_rt a选择引文计数。

[title=Cite] + a

<4.7.1的替代选择。

import requests
from bs4 import BeautifulSoup as bs

queries = ['Role for migratory wild birds in the global spread of avian influenza H5N8',
         'Uncoupling conformational states from activity in an allosteric enzyme',
         'Technological Analysis of the World’s Earliest Shamanic Costume: A Multi-Scalar, Experimental Study of a Red Deer Headdress from the Early Holocene Site of Star Carr, North Yorkshire, UK',
         'Oxidative potential of PM 2.5  during Atlanta rush hour: Measurements of in-vehicle dithiothreitol (DTT) activity',
         'Primary Prevention of CVD','Growth and Deposition of Au Nanoclusters on Polymer-wrapped Graphene and Their Oxygen Reduction Activity',
         'Relations of Preschoolers Visual-Motor and Object Manipulation Skills With Executive Function and Social Behavior',
         'We Know Who Likes Us, but Not Who Competes Against Us']

with requests.Session() as s:
    for query in queries:
        url = 'https://scholar.google.com/scholar?q=' + query + '&ie=UTF-8&oe=UTF-8&hl=en&btnG=Search'
        r = s.get(url)
        soup = bs(r.content, 'lxml') # or 'html.parser'
        title = soup.select_one('h3.gs_rt a').text if soup.select_one('h3.gs_rt a') is not None else 'No title'
        link = soup.select_one('h3.gs_rt a')['href'] if title != 'No title' else 'No link'
        citations = soup.select_one('a:contains("Cited by")').text if soup.select_one('a:contains("Cited by")') is not None else 'No citation count'
        print(title, link, citations) 

由于@facelessuser的评论,底部版本被重写。尚需比较的最高版本:

如果在单行if语句中不两次调用select_one可能会更有效。缓存模式构建时,不缓存返回的标签。我个人会将变量设置为select_one返回的任何值,然后,仅当变量为None时,才将其更改为No link或No title等。它虽然不那么紧凑,但会更有效 < / p>

[...]始终检查标签是否为None :,而不仅仅是标签:。使用选择器并没有什么大不了的,因为它们只会返回标签,但是如果您在tag.descendants中对x做过类似的事情,您将得到文本节点(字符串)和标签,即使是空字符串也将得出false它是一个有效的节点。在这种情况下,最安全的方法是检查无

答案 1 :(得分:1)

建议您搜索包含<h3>和引文的标签(在<h3>内),而不是查找所有的<div class="gs_rs>"标签,即查找所有的<div class="gs_ri">标签

然后从这些标记中,您应该能够获得所需的一切:

query = ['Role for migratory wild birds in the global spread of avian influenza H5N8','Uncoupling conformational states from activity in an allosteric enzyme','Technological Analysis of the World’s Earliest Shamanic Costume: A Multi-Scalar, Experimental Study of a Red Deer Headdress from the Early Holocene Site of Star Carr, North Yorkshire, UK','Oxidative potential of PM 2.5  during Atlanta rush hour: Measurements of in-vehicle dithiothreitol (DTT) activity','Primary Prevention of CVD','Growth and Deposition of Au Nanoclusters on Polymer- wrapped Graphene and Their Oxygen Reduction Activity','Relations of Preschoolers Visual-Motor and Object Manipulation Skills With Executive Function and Social Behavior','We Know Who Likes Us, but Not Who Competes Against Us']

url = 'https://scholar.google.com/scholar?q=' + query + '&ie=UTF-8&oe=UTF-8&hl=en&btnG=Search'

content = requests.get(url).text
page = BeautifulSoup(content, 'lxml')
results = []
for entry in page.find_all("div", attrs={"class": "gs_ri"}): #tag containing both h3 and citation
    results.append({"title": entry.h3.a.text, "url": entry.a['href'], "citation": entry.find("div", attrs={"class": "gs_rs"}).text})