如何从BeautifulSoup获取搜索结果?

时间:2019-12-20 21:59:37

标签: python beautifulsoup

我还不太习惯Beautifulsoup(即使它非常有用)。我的问题是,如果我有这样的网站

https://bigd.big.ac.cn/dogsdv2/pages/modules/indsnp/indsnp_search.jsp

通过将P2RY12传递到“基因名称”输入框中来获得结果,我该怎么办?

通常,如果我想从某个网站获得搜索结果,该怎么办?

1 个答案:

答案 0 :(得分:3)

如果打开Firefox / Chrome网站管理员工具,则可以观察页面在哪里发出请求。因此,当在搜索框中输入P2RY12并单击“提交”按钮时,页面正在向http://bigd.big.ac.cn/dogsdv2/indsnp/searchIndSNPSingle.action发出POST请求。

通常,您需要知道URL和发送到URL的参数以获取任何信息。

此示例从结果的第一页获取一些信息:

import requests
from bs4 import BeautifulSoup

url = 'http://bigd.big.ac.cn/dogsdv2/indsnp/searchIndSNPSingle.action'

data = {
    'totalCount': -1,
    'searchForm.chrom': 0,
    'searchForm.start': '',
    'searchForm.end': '',
    'searchForm.rsid': '',
    'searchForm.popu':  0,
    'searchForm.geneid': '',
    'searchForm.genename': 'P2RY12',
    'searchForm.goterm': '',
    'searchForm.gokeyword': '',
    'searchForm.limitFlag': 1,
    'searchForm.numlimit':  1000
}

headers = {
    'Referer': 'https://bigd.big.ac.cn/dogsdv2/pages/modules/indsnp/indsnp_search.jsp',
}

soup = BeautifulSoup(requests.post(url, data=data, headers=headers).text, 'html.parser')

for td in soup.select('table.table7 tr > td:nth-child(3)'):
    a = td.select_one('a')
    print('SNP ID:', a.get_text(strip=True))
    t1 = a.find_next_sibling('br').find_next_sibling(text=True)
    print('Position:', t1.strip())
    print('Location:', ', '.join( l.get_text(strip=True) for l in t1.find_next_siblings('a') ))
    print('Genotype:', a.find_next_siblings('br')[2].find_next_sibling(text=True).strip())
    print('-' * 80)

打印:

SNP ID: cfa19627795
Position: Chr23:45904511
Location: ENSCAFG00000008485, ENSCAFG00000008531, ENSCAFG00000008534
Genotype: G
--------------------------------------------------------------------------------
SNP ID: cfa19627797
Position: Chr23:45904579
Location: ENSCAFG00000008485, ENSCAFG00000008531, ENSCAFG00000008534
Genotype: C
--------------------------------------------------------------------------------
SNP ID: cfa19627803
Position: Chr23:45904842
Location: ENSCAFG00000008485, ENSCAFG00000008531, ENSCAFG00000008534
Genotype: C
--------------------------------------------------------------------------------

...and so on.