我想在网站上提取一些信息,但urllib不会提取与浏览器中显示的网站相同的信息。
generic_link_seq = "http://yeastmine.yeastgenome.org/yeastmine/sequenceExporter.do?object=1016810"
sauce = urllib.request.urlopen(generic_link_seq).read()
soup = bs.BeautifulSoup(sauce,"lxml")
text = soup.get_text().replace("\n", "")
print(text)
网站的真实内容始于:
S000006360 atgaacagacaggaatccataaattcgtttaattcagacgaaacatcttcgttgtctgat gtagaaagtcagcagccgcaacaatatatcccttcagagagtggatctaaatccaacatg gctcctaatcaactgaagttgacccggacggaaaccgtgaagtcattgcaggac ...
python给我的输出开头是:
YeastMine:HomejQuery&& jQuery(function(){if(typeof intermine!=='undefined'&& intermine.options){intermine.options.CDN.server =“http://yeastmine.yeastgenome.org/CDN/”;}});搜索并检索S.使用YeastMine进行酿酒酵母数据,由SGD填充,由InterMine提供支持。数据更新日期:Feb-6-2017主页模板列表QueryBuilder工具...
答案 0 :(得分:0)
尝试使用requests.Session
(您需要先获得一些Cookie)
import requests
from bs4 import BeautifulSoup
generic_link_seq = "http://yeastmine.yeastgenome.org/yeastmine/sequenceExporter.do?object=1016810"
ses = requests.Session()
ses.get(generic_link_seq).text
sauce = ses.get(generic_link_seq).text
soup = BeautifulSoup(sauce,"lxml")
text = soup.get_text().replace("\n", "")
print(text)
结果:
> S000006360atgaacagacaggaatccataaattcgtttaattcagacgaaacatcttcgttgtctgat gtagaaagtcagcagccgcaacaatatatcccttcagagagtggatctaaatccaacatg gctcctaatcaactgaagttgacccggacggaaaccgtgaagtcattgcaggac ...