我正在尝试抓取this页面。
我写了这段代码:
import pandas as pd
output_file = open('neuropep.txt', 'a')
for i in range(1,2):
number = '{:05}'.format(i)
url = 'http://isyslab.info/NeuroPep/search_info?pepNum=NP' + str(number)
tables = pd.read_html(url)
print(tables[0][1])
输出是:
0 NP00001
1 7B2 C-terminal peptide (5-13)
2 Rattus norvegicus
3 10116
4 NaN
5 7B2
6 NaN
7 9
8 NaN
9 NaN
10 FSEEEKEPE
11 View
12 NaN
13 NaN
Name: 1, dtype: object
但是我可以从链接中看到,第13行应该说:
Karlsson O, Kultima K, Wadensten H, Nilsson A, Roman E, Andrén PE, Brittebo EB Neurotoxin-induced neuropeptide perturbations in striatum of neonatal rats J Proteome Res 2013 Apr 5;12(4):1678-90
PMID: 23410195
我无法解决差异?我试图弄乱表的不同部分,但是我不确定如何确定丢失的数据在哪里。我实际上不需要整个参考,只需要PubMed ID。
编辑1:尝试了beautifulsoup:
for i in range(1,2):
number = '{:05}'.format(i)
url = 'http://isyslab.info/NeuroPep/search_info?pepNum=NP' + str(number)
res = requests.get(url)
soup = BeautifulSoup(res.content, 'lxml')
table = soup.find_all('li')
print(table)