无法读取HTML数据 - Python

时间:2011-06-28 19:46:17

标签: python html urllib2 mechanize

我正在尝试使用BeautifulSoup for python从网站解析html数据。但是,urllib2或mechanize无法读取整个html格式。返回的数据是

<html>
<head>
    <title>
    EC 4.1.2.13 - Fructose-bisphosphate aldolase    </title>
    <meta name="description" content="Information on EC 4.1.2.13 - Fructose-bisphosphate aldolase">
    <meta name="keywords" content="EC,Number,Enzyme,Pathway,Reaction,Organism,Substrate,Cofactor,Inhibitor,Compound,KM Value,KI Value,IC50 Value,pi Value,Turnover Number,pH,Temperature,Optimum,Range,Source Tissue,BLAST,Subunits,Modification,Crystallization,Stability,Purification">
</head>
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Frameset//EN" "http://www.w3.org/TR/html4/frameset.dtd">
<frameset cols="190,*" border="0">
    <frame name="navigation" src="flat_navigation.php4?ecno=4.1.2.13&organism_list=Mycobacterium tuberculosis&Suchword=&UniProtAcc=P67475" frameborder="no">
    <frameset rows="110,*" border="0">
            <frame name="header" src="flat_head.php4?ecno=4.1.2.13" frameborder="no">

        <frame name="flat" src="flat_result.php4?ecno=4.1.2.13&organism_list=Mycobacterium tuberculosis&Suchword=&UniProtAcc=P67475" frameborder="no">

    </frameset>
</frameset>
<noframes>
<body>
<h1>EC 4.1.2.13 - Fructose-bisphosphate aldolase </h1>

<a href="flat_result.php4?ecno=4.1.2.13&organism_list=Mycobacterium tuberculosis&Suchword=&UniProtAcc=P67475">More detailed information on the enzyme EC 4.1.2.13 - Fructose-bisphosphate aldolase</a>

Sorry, but your browser doesn't support frames. Please use another browser!
</body>
</noframes>
</html>

当我使用Internet Explorer手动打开webste时,可以读取整个html。无论如何使用urllib2,mechanize或BeautifulSoup来解决这个问题?

1 个答案:

答案 0 :(得分:3)

那是因为内容在框架中。您可以解析页面并查找主src元素的<frame>属性,也可以直接请求该帧。在大多数浏览器中,您可以右键单击并选择“框架属性”,以获取框架的URL。