我正在尝试从html URL获取数据,该URL嵌入了txt文件。我不知道该如何处理。我想将所有行都转换为csv文件。
from bs4 import BeautifulSoup
import requests
url = "https://www.cftc.gov/dea/options/deaviewcit.htm"
page = requests.get(url)
pagetext = page.text
soup = BeautifulSoup(pagetext, 'html.parser')
print(soup)
这给出了以下输出;
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html lang="en-US" xml:lang="en" xmlns="http://www.w3.org/1999/xhtml">
<head>
<title>CFTC Commitments of Traders Supplemental - CIT (Combined)</title>
<!--begin meta-->
<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
<meta content="no-cache" http-equiv="Pragma">
<meta content="0" http-equiv="Expires">
<meta content="This is the viewable version of the most recent release of the CIT Supplemental commitments report." name="description"/>
<meta content="deaviewcit.htm" name="file"/>
<meta content="commitments, long, CFTC, CIT, Supplemental" name="keywords"/>
<meta content="dea" name="office"/>
<!--end meta-->
<script src="http://www.google-analytics.com/ga.js" type="text/javascript"></script><script type="text/javascript">var pageTracker = _gat._getTracker("UA-21047137-1"); pageTracker._trackPageview();</script><script src="/ucm/fragments/web_header/js/gaAddons.js" type="text/javascript"></script></meta></meta></head>
<body>
<!--begin content-->
<pre>
<!--ih:includeHTML file="deaviewcit.txt"-->
COT -- Supplemental Report - Option and Futures Combined Positions as of August 13, 2019
: Reportable Positions : Nonreportable
:---------------------------------------------------------------------------------------- : Positions
: Non-Commercial : Commercial : Index Traders : Total
: Long : Short :Spreading: Long : Short : Long : Short : Long : Short : Long : Short
-------------------------------------------------------------------------------------------------------------------
WHEAT-SRW - CHICAGO BOARD OF TRADE
(我只复制了标题部分)
希望将嵌入的txt文件的所有行保存到csv文件中。
谢谢。