Question

我正在尝试从html URL获取数据，该URL嵌入了txt文件。我不知道该如何处理。我想将所有行都转换为csv文件。

from bs4 import BeautifulSoup
import requests

url = "https://www.cftc.gov/dea/options/deaviewcit.htm"
page = requests.get(url)
pagetext = page.text

soup = BeautifulSoup(pagetext, 'html.parser')

print(soup)

这给出了以下输出；

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">

<html lang="en-US" xml:lang="en" xmlns="http://www.w3.org/1999/xhtml">
<head>
<title>CFTC Commitments of Traders Supplemental - CIT (Combined)</title>
<!--begin meta-->
<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
<meta content="no-cache" http-equiv="Pragma">
<meta content="0" http-equiv="Expires">
<meta content="This is the viewable version of the most recent release of the CIT Supplemental commitments report." name="description"/>
<meta content="deaviewcit.htm" name="file"/>
<meta content="commitments, long, CFTC, CIT, Supplemental" name="keywords"/>
<meta content="dea" name="office"/>
<!--end meta-->
<script src="http://www.google-analytics.com/ga.js" type="text/javascript"></script><script type="text/javascript">var pageTracker = _gat._getTracker("UA-21047137-1"); pageTracker._trackPageview();</script><script src="/ucm/fragments/web_header/js/gaAddons.js" type="text/javascript"></script></meta></meta></head>
<body>
<!--begin content-->
<pre>

<!--ih:includeHTML file="deaviewcit.txt"-->
COT -- Supplemental Report - Option and Futures Combined Positions as of August 13, 2019                 
    :                                    Reportable Positions                                 :    Nonreportable
    :---------------------------------------------------------------------------------------- :      Positions
    :         Non-Commercial      :      Commercial   :     Index Traders :        Total
    :    Long :   Short :Spreading:    Long :   Short :    Long :   Short :    Long :   Short :    Long :   Short
-------------------------------------------------------------------------------------------------------------------
WHEAT-SRW - CHICAGO BOARD OF TRADE

（我只复制了标题部分）

希望将嵌入的txt文件的所有行保存到csv文件中。

谢谢。

以html代码提取txt文件并转换为csv

0 个答案: