Urllib读错了网站

时间:2017-04-23 18:24:34

标签: python beautifulsoup web urllib

我想在网站上提取一些信息,但urllib不会提取与浏览器中显示的网站相同的信息。

generic_link_seq = "http://yeastmine.yeastgenome.org/yeastmine/sequenceExporter.do?object=1016810"

sauce = urllib.request.urlopen(generic_link_seq).read()
soup = bs.BeautifulSoup(sauce,"lxml")
text = soup.get_text().replace("\n", "")
print(text)

网站的真实内容始于:

  

S000006360   atgaacagacaggaatccataaattcgtttaattcagacgaaacatcttcgttgtctgat   gtagaaagtcagcagccgcaacaatatatcccttcagagagtggatctaaatccaacatg   gctcctaatcaactgaagttgacccggacggaaaccgtgaagtcattgcaggac ...

python给我的输出开头是:

  

YeastMine:HomejQuery&& jQuery(function(){if(typeof intermine!=='undefined'&& intermine.options){intermine.options.CDN.server =“http://yeastmine.yeastgenome.org/CDN/”;}});搜索并检索S.使用YeastMine进行酿酒酵母数据,由SGD填充,由InterMine提供支持。数据更新日期:Feb-6-2017主页模板列表QueryBuilder工具...

1 个答案:

答案 0 :(得分:0)

尝试使用requests.Session(您需要先获得一些Cookie)

import requests
from bs4 import BeautifulSoup

generic_link_seq = "http://yeastmine.yeastgenome.org/yeastmine/sequenceExporter.do?object=1016810"
ses = requests.Session()
ses.get(generic_link_seq).text
sauce = ses.get(generic_link_seq).text
soup = BeautifulSoup(sauce,"lxml")
text = soup.get_text().replace("\n", "")
print(text)

结果:

  

> S000006360atgaacagacaggaatccataaattcgtttaattcagacgaaacatcttcgttgtctgat gtagaaagtcagcagccgcaacaatatatcccttcagagagtggatctaaatccaacatg gctcctaatcaactgaagttgacccggacggaaaccgtgaagtcattgcaggac ...