当我查看此链接https://www.cftc.gov/sites/default/files/files/dea/cotarchives/2015/futures/financial_lf061615.htm时,文本将以清晰的方式显示。但是,当我尝试使用漂亮的汤解析页面时,我输出的内容看起来并不一样-都弄乱了。这是代码
import urllib.request
from bs4 import BeautifulSoup
request = urllib.request.Request('https://www.cftc.gov/sites/default/files/files/dea/cotarchives/2015/futures/financial_lf061615.htm')
htm = urllib.request.urlopen(request).read()
soup = BeautifulSoup(htm,'html.parser')
text = soup.get_text()
print(text)
所需的输出看起来像这样
-----------------------------------------------------------------------------------------------------------------------------------------------------------
Traders in Financial Futures - Futures Only Positions as of June 16, 2015
-----------------------------------------------------------------------------------------------------------------------------------------------------------
Dealer : Asset Manager/ : Leveraged : Other : Nonreportable :
Intermediary : Institutional : Funds : Reportables : Positions :
Long : Short : Spreading: Long : Short : Spreading: Long : Short : Spreading: Long : Short : Spreading: Long : Short :
-----------------------------------------------------------------------------------------------------------------------------------------------------------
DOW JONES UBS EXCESS RETURN - CHICAGO BOARD OF TRADE ($100 X INDEX)
CFTC Code #221602 Open Interest is 19,721
Positions
97 2,934 0 8,941 1,574 973 6,490 11,975 1,694 1,372 539 0 154 32
Changes from: June 9, 2015 Total Change is: 3,505
48 0 0 2,013 1,141 70 447 1,369 923 -64 0 0 68 2
Percent of Open Interest Represented by Each Category of Trader
0.5 14.9 0.0 45.3 8.0 4.9 32.9 60.7 8.6 7.0 2.7 0.0 0.8 0.2
Number of Traders in Each Category Total Traders: 31
. . 0 5 . . 6 9 . 5 . 0
-----------------------------------------------------------------------------------------------------------------------------------------------------------
查看页面源代码后,我不清楚该样式如何出现新行-我认为这是问题出处。
在BeautifulSoup函数中是否需要指定某种类型的结构?我在这里很迷路,因此不胜感激。
以前我已经安装了html2text模块,并且没有运气使用!conda config --append channels conda-forge
和!conda install html2text
欢呼
编辑:香港专业教育学院弄清楚了。我是个脑子
request = urllib.request.Request('https://www.cftc.gov/sites/default/files/files/dea/cotarchives/2015/futures/financial_lf061615.htm')
htm = urllib.request.urlopen(request).read()
htm = htm.decode('windows-1252')
htm = htm.replace('\n','').replace('\r','')
htm = htm.split('</pre><pre>')
cleaned = []
for i in htm:
i = BeautifulSoup(i,'html.parser' ).get_text()
cleaned.append(i)
with open('trouble.txt','w') as f:
for line in cleaned:
f.write('%s\n' % line)