我有大约2万篇文章的标题列表,我想从Google Scholar中删除其引用次数。我是BeautifulSoup库的新手。我有以下代码:
import requests
from bs4 import BeautifulSoup
query = ['Role for migratory wild birds in the global spread of avian
influenza H5N8','Uncoupling conformational states from activity in an
allosteric enzyme','Technological Analysis of the World’s Earliest
Shamanic Costume: A Multi-Scalar, Experimental Study of a Red Deer
Headdress from the Early Holocene Site of Star Carr, North Yorkshire,
UK','Oxidative potential of PM 2.5 during Atlanta rush hour:
Measurements of in-vehicle dithiothreitol (DTT) activity','Primary
Prevention of CVD','Growth and Deposition of Au Nanoclusters on Polymer-
wrapped Graphene and Their Oxygen Reduction Activity','Relations of
Preschoolers Visual-Motor and Object Manipulation Skills With Executive
Function and Social Behavior','We Know Who Likes Us, but Not Who Competes
Against Us']
url = 'https://scholar.google.com/scholar?q=' + query + '&ie=UTF-8&oe=UTF-
8&hl=en&btnG=Search'
content = requests.get(url).text
page = BeautifulSoup(content, 'lxml')
results = []
for entry in page.find_all("h3", attrs={"class": "gs_rt"}):
results.append({"title": entry.a.text, "url": entry.a['href']})
,但仅返回标题和URL。我不知道如何从另一个标签获取引用信息。请在这里帮助我。
答案 0 :(得分:1)
您需要循环播放列表。您可以使用Session来提高效率。以下是针对bs 4.7.1的支持,它支持MemoryError Traceback (most recent call last)
<ipython-input-10-9aa6e73eb179> in <module>
20 rr = encrelrot(v_positions, faces, r_v_positions, f_neighbors)
21
---> 22 modelout = dectrans(decrelrot(rr, f_neighbors), faces, r_v_positions)
<ipython-input-8-cdb51dd3cadf> in dectrans(features, faces, template)
616 print("Size dx", dx.nnz)
617 #M = M.tocsr()
--> 618 model = scipy.sparse.linalg.spsolve(Mt @ M, Mt.dot(dx))
619
620 return model
~/anaconda3/lib/python3.6/site-packages/scipy/sparse/base.py in __matmul__(self, other)
560 raise ValueError("Scalar operands are not allowed, "
561 "use '*' instead")
--> 562 return self.__mul__(other)
563
564 def __rmatmul__(self, other):
~/anaconda3/lib/python3.6/site-packages/scipy/sparse/base.py in __mul__(self, other)
480 if self.shape[1] != other.shape[0]:
481 raise ValueError('dimension mismatch')
--> 482 return self._mul_sparse_matrix(other)
483
484 # If it's a list or whatever, treat it like a matrix
~/anaconda3/lib/python3.6/site-packages/scipy/sparse/compressed.py in _mul_sparse_matrix(self, other)
494 other.indptr, other.indices),
495 maxval=M*N)
--> 496 indptr = np.empty(major_axis + 1, dtype=idx_dtype)
497
498 fn = getattr(_sparsetools, self.format + '_matmat_pass1')
MemoryError:
伪类来查找引用计数。看起来您可以从css选择器中删除:contains
类型选择器,而只需在h3
之前使用类,即a
。如果您没有4.7.1。您可以改用.gs_rt a
选择引文计数。
[title=Cite] + a
<4.7.1的替代选择。
import requests
from bs4 import BeautifulSoup as bs
queries = ['Role for migratory wild birds in the global spread of avian influenza H5N8',
'Uncoupling conformational states from activity in an allosteric enzyme',
'Technological Analysis of the World’s Earliest Shamanic Costume: A Multi-Scalar, Experimental Study of a Red Deer Headdress from the Early Holocene Site of Star Carr, North Yorkshire, UK',
'Oxidative potential of PM 2.5 during Atlanta rush hour: Measurements of in-vehicle dithiothreitol (DTT) activity',
'Primary Prevention of CVD','Growth and Deposition of Au Nanoclusters on Polymer-wrapped Graphene and Their Oxygen Reduction Activity',
'Relations of Preschoolers Visual-Motor and Object Manipulation Skills With Executive Function and Social Behavior',
'We Know Who Likes Us, but Not Who Competes Against Us']
with requests.Session() as s:
for query in queries:
url = 'https://scholar.google.com/scholar?q=' + query + '&ie=UTF-8&oe=UTF-8&hl=en&btnG=Search'
r = s.get(url)
soup = bs(r.content, 'lxml') # or 'html.parser'
title = soup.select_one('h3.gs_rt a').text if soup.select_one('h3.gs_rt a') is not None else 'No title'
link = soup.select_one('h3.gs_rt a')['href'] if title != 'No title' else 'No link'
citations = soup.select_one('a:contains("Cited by")').text if soup.select_one('a:contains("Cited by")') is not None else 'No citation count'
print(title, link, citations)
由于@facelessuser的评论,底部版本被重写。尚需比较的最高版本:
如果在单行if语句中不两次调用select_one可能会更有效。缓存模式构建时,不缓存返回的标签。我个人会将变量设置为select_one返回的任何值,然后,仅当变量为None时,才将其更改为No link或No title等。它虽然不那么紧凑,但会更有效 < / p>
[...]始终检查标签是否为None :,而不仅仅是标签:。使用选择器并没有什么大不了的,因为它们只会返回标签,但是如果您在tag.descendants中对x做过类似的事情,您将得到文本节点(字符串)和标签,即使是空字符串也将得出false它是一个有效的节点。在这种情况下,最安全的方法是检查无
答案 1 :(得分:1)
建议您搜索包含<h3>
和引文的标签(在<h3>
内),而不是查找所有的<div class="gs_rs>"
标签,即查找所有的<div class="gs_ri">
标签
然后从这些标记中,您应该能够获得所需的一切:
query = ['Role for migratory wild birds in the global spread of avian influenza H5N8','Uncoupling conformational states from activity in an allosteric enzyme','Technological Analysis of the World’s Earliest Shamanic Costume: A Multi-Scalar, Experimental Study of a Red Deer Headdress from the Early Holocene Site of Star Carr, North Yorkshire, UK','Oxidative potential of PM 2.5 during Atlanta rush hour: Measurements of in-vehicle dithiothreitol (DTT) activity','Primary Prevention of CVD','Growth and Deposition of Au Nanoclusters on Polymer- wrapped Graphene and Their Oxygen Reduction Activity','Relations of Preschoolers Visual-Motor and Object Manipulation Skills With Executive Function and Social Behavior','We Know Who Likes Us, but Not Who Competes Against Us']
url = 'https://scholar.google.com/scholar?q=' + query + '&ie=UTF-8&oe=UTF-8&hl=en&btnG=Search'
content = requests.get(url).text
page = BeautifulSoup(content, 'lxml')
results = []
for entry in page.find_all("div", attrs={"class": "gs_ri"}): #tag containing both h3 and citation
results.append({"title": entry.h3.a.text, "url": entry.a['href'], "citation": entry.find("div", attrs={"class": "gs_rs"}).text})