我还不太习惯Beautifulsoup(即使它非常有用)。我的问题是,如果我有这样的网站
https://bigd.big.ac.cn/dogsdv2/pages/modules/indsnp/indsnp_search.jsp
通过将P2RY12传递到“基因名称”输入框中来获得结果,我该怎么办?
通常,如果我想从某个网站获得搜索结果,该怎么办?
答案 0 :(得分:3)
如果打开Firefox / Chrome网站管理员工具,则可以观察页面在哪里发出请求。因此,当在搜索框中输入P2RY12
并单击“提交”按钮时,页面正在向http://bigd.big.ac.cn/dogsdv2/indsnp/searchIndSNPSingle.action
发出POST请求。
通常,您需要知道URL和发送到URL的参数以获取任何信息。
此示例从结果的第一页获取一些信息:
import requests
from bs4 import BeautifulSoup
url = 'http://bigd.big.ac.cn/dogsdv2/indsnp/searchIndSNPSingle.action'
data = {
'totalCount': -1,
'searchForm.chrom': 0,
'searchForm.start': '',
'searchForm.end': '',
'searchForm.rsid': '',
'searchForm.popu': 0,
'searchForm.geneid': '',
'searchForm.genename': 'P2RY12',
'searchForm.goterm': '',
'searchForm.gokeyword': '',
'searchForm.limitFlag': 1,
'searchForm.numlimit': 1000
}
headers = {
'Referer': 'https://bigd.big.ac.cn/dogsdv2/pages/modules/indsnp/indsnp_search.jsp',
}
soup = BeautifulSoup(requests.post(url, data=data, headers=headers).text, 'html.parser')
for td in soup.select('table.table7 tr > td:nth-child(3)'):
a = td.select_one('a')
print('SNP ID:', a.get_text(strip=True))
t1 = a.find_next_sibling('br').find_next_sibling(text=True)
print('Position:', t1.strip())
print('Location:', ', '.join( l.get_text(strip=True) for l in t1.find_next_siblings('a') ))
print('Genotype:', a.find_next_siblings('br')[2].find_next_sibling(text=True).strip())
print('-' * 80)
打印:
SNP ID: cfa19627795
Position: Chr23:45904511
Location: ENSCAFG00000008485, ENSCAFG00000008531, ENSCAFG00000008534
Genotype: G
--------------------------------------------------------------------------------
SNP ID: cfa19627797
Position: Chr23:45904579
Location: ENSCAFG00000008485, ENSCAFG00000008531, ENSCAFG00000008534
Genotype: C
--------------------------------------------------------------------------------
SNP ID: cfa19627803
Position: Chr23:45904842
Location: ENSCAFG00000008485, ENSCAFG00000008531, ENSCAFG00000008534
Genotype: C
--------------------------------------------------------------------------------
...and so on.