使用漂亮的汤进行HTML解析可以提供与网站不同的结构

时间:2018-07-28 06:33:35

标签: python html python-3.x web-scraping beautifulsoup

当我查看此链接https://www.cftc.gov/sites/default/files/files/dea/cotarchives/2015/futures/financial_lf061615.htm时,文本将以清晰的方式显示。但是,当我尝试使用漂亮的汤解析页面时,我输出的内容看起来并不一样-都弄乱了。这是代码

import urllib.request
from bs4 import BeautifulSoup

request = urllib.request.Request('https://www.cftc.gov/sites/default/files/files/dea/cotarchives/2015/futures/financial_lf061615.htm')
htm = urllib.request.urlopen(request).read()
soup = BeautifulSoup(htm,'html.parser')
text = soup.get_text()
print(text)

所需的输出看起来像这样

-----------------------------------------------------------------------------------------------------------------------------------------------------------
Traders in Financial Futures - Futures Only Positions as of June 16, 2015                   
-----------------------------------------------------------------------------------------------------------------------------------------------------------
              Dealer            :           Asset Manager/       :            Leveraged           :              Other             :     Nonreportable    :
           Intermediary         :           Institutional        :              Funds             :           Reportables          :       Positions      :
    Long  :   Short  : Spreading:    Long  :   Short  : Spreading:    Long  :   Short  : Spreading:    Long  :   Short  : Spreading:    Long  :   Short   :
-----------------------------------------------------------------------------------------------------------------------------------------------------------
DOW JONES UBS EXCESS RETURN - CHICAGO BOARD OF TRADE   ($100 X INDEX)                               
CFTC Code #221602                                                    Open Interest is    19,721
Positions
        97      2,934          0      8,941      1,574        973      6,490     11,975      1,694      1,372        539          0        154         32

Changes from:       June 9, 2015                                     Total Change is:     3,505
        48          0          0      2,013      1,141         70        447      1,369        923        -64          0          0         68          2

Percent of Open Interest Represented by Each Category of Trader
       0.5       14.9        0.0       45.3        8.0        4.9       32.9       60.7        8.6        7.0        2.7        0.0        0.8        0.2

Number of Traders in Each Category                                    Total Traders:        31 
         .          .          0          5          .          .          6          9          .          5          .          0
-----------------------------------------------------------------------------------------------------------------------------------------------------------

查看页面源代码后,我不清楚该样式如何出现新行-我认为这是问题出处。

在BeautifulSoup函数中是否需要指定某种类型的结构?我在这里很迷路,因此不胜感激。

以前我已经安装了html2text模块,并且没有运气使用!conda config --append channels conda-forge!conda install html2text

在anaconda上安装

欢呼

编辑:香港专业教育学院弄清楚了。我是个脑子

request = urllib.request.Request('https://www.cftc.gov/sites/default/files/files/dea/cotarchives/2015/futures/financial_lf061615.htm')
htm = urllib.request.urlopen(request).read()
htm = htm.decode('windows-1252')
htm = htm.replace('\n','').replace('\r','')
htm = htm.split('</pre><pre>')

cleaned = []
for i in htm:
    i = BeautifulSoup(i,'html.parser' ).get_text()
    cleaned.append(i)

with open('trouble.txt','w') as f:
    for line in cleaned:
        f.write('%s\n' % line)

0 个答案:

没有答案