我正在尝试抓取this page并获取每篇文章标题的网址,该文章的标题是“ h3”元素,例如第一个结果是一个带有文本“全长小鼠cDNA集合的功能注释”的链接,该链接链接到this page。
我的搜索返回的全部是“ []”
我的代码如下:
import requests
from bs4 import BeautifulSoup
req = requests.get('https://www.lens.org/lens/scholar/search/results?q="edith%20cowan"')
soup = BeautifulSoup(req.content, "html5lib")
article_links = soup.select('h3 a')
print(article_links)
我要去哪里错了?
答案 0 :(得分:1)
由于使用了错误的链接来获取文章链接,因此您正在处理此问题。因此,我进行了一些更改,并提出了以下代码(请注意,由于不再需要bs4模块,因此我将其删除了):
import requests
search = "edith cowan"
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36'}
json = {"scholarly_search":{"from":0,"size":"10","_source":{"excludes":["referenced_by_patent_hash","referenced_by_patent","reference"]},"query":{"bool":{"must":[{"query_string":{"query":f"\"{search}\"","fields":["title","abstract","default"],"default_operator":"and"}}],"must_not":[{"terms":{"publication_type":["unknown"]}}],"filter":[]}},"highlight":{"pre_tags":["<span class=\"highlight\">"],"post_tags":["</span>"],"fields":{"title":{}},"number_of_fragments":0},"sort":[{"referenced_by_patent_count":{"order":"desc"}}]},"view":"scholar"}
req = requests.post("https://www.lens.org/lens/api/multi/search?request_cache=true", headers = headers, json = json).json()
links = []
for x in req["query_result"]["hits"]["hits"]:
links.append("https://www.lens.org/lens/scholar/article/{}/main".format(x["_source"]["record_lens_id"]))
search
变量等于您要搜索的术语(在您的情况下为"edith cowan"
)。链接存储在links
变量中。
修改:我是怎么做到的
因此,主要问题可能是我从哪里获得链接以及如何知道要包含在json
变量中的内容。为此,我使用了一个简单的HTML拦截器(在我的情况下为burp suite community edition)。
此工具向我展示了当您访问此URL (您在问题中用于向其发送请求的人)时,浏览器向{{3} },然后检索https://www.lens.org/lens/api/multi/search?request_cache=true的所有信息。有关json变量burp套件的问题还向您显示了发送了什么数据包,因此我将它们复制粘贴到json
变量中。
为了获得更好的可视化效果,这是它在打p套件中的外观:
编辑:扫描所有页面
为了扫描所有页面,可以使用以下脚本:
import requests
search = "edith cowan" #Change this to the term you are searching for
r_to_show = 100 #This is the number of articles per page (I strongly recommend leaving it at 100)
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36'}
json = {"scholarly_search":{"from":0,"size":f"{r_to_show}","_source":{"excludes":["referenced_by_patent_hash","referenced_by_patent","reference"]},"query":{"bool":{"must":[{"query_string":{"query":f"\"{search}\"","fields":["title","abstract","default"],"default_operator":"and"}}],"must_not":[{"terms":{"publication_type":["unknown"]}}],"filter":[]}},"highlight":{"pre_tags":["<span class=\"highlight\">"],"post_tags":["</span>"],"fields":{"title":{}},"number_of_fragments":0},"sort":[{"referenced_by_patent_count":{"order":"desc"}}]},"view":"scholar"}
req = requests.post("https://www.lens.org/lens/api/multi/search?request_cache=true", headers = headers, json = json).json()
links = [] #links are stored here
count = 0
#link_before and link_after helps determine when to stop going to the next page
link_before = 0
link_after = 0
while True:
json["scholarly_search"]["from"] += r_to_show
if count > 0:
req = requests.post("https://www.lens.org/lens/api/multi/search?request_cache=true", headers = headers, json = json).json()
for x in req["query_result"]["hits"]["hits"]:
links.append("https://www.lens.org/lens/scholar/article/{}/main".format(x["_source"]["record_lens_id"]))
count += 1
link_after = len(links)
if link_after == link_before:
break
link_before = len(links)
print(f"page {count} done, links recorder {len(links)}")
我在代码中添加了一些注释,以使其更易于理解。