我在python3中编写代码来解析html / css表。有一些问题:
作为一个菜鸟,我确信我的代码非常简陋,并且会有更好的方法来做到这一点,并希望学会更好地编写这个并修复上述内容。
需要建议&指导,请帮忙。谢谢。
我的代码:
import csv
import requests
from bs4 import BeautifulSoup
## SIDs.csv contains ids in col2 based on which the 'url' variable pulls the respective data
SIDFile = open('SIDs.csv')
SIDReader = csv.reader(SIDFile)
SID = list(SIDReader)
SqID_data = []
#create and open output file
with open('output.csv','a', newline='') as csv_h:
fields = \
[
"ID",
"Financial Year",
"Total Income",
"Total Expenses",
"Tax Expense",
"Net Profit"
]
for row in SID:
col1,col2 = row
SID ="%s" % (col2)
url = requests.get("http://.......")
soup = BeautifulSoup(url.text, "lxml")
fy = soup.findAll('td',{'class':'tablehead'})
titles = soup.findAll('td',{'class':'t1'})
values = soup.findAll('td',{'class':'t0'})
if titles:
data = {}
for title in titles:
name = title.find("td", class_ = "t1")
data["ID"] = SID
data["Financial Year"] = fy[0].string.strip()
data["Total Income"] = values[0].string.strip()
data["Total Expenses"] = values[1].string.strip()
data["Tax Expense"] = values[2].string.strip()
data["Net Profit"] = values[3].string.strip()
SqID_data.append(data)
#Prepare CSV writer.
writer = csv.DictWriter\
(
csv_h,
fields,
quoting = csv.QUOTE_ALL,
extrasaction = "ignore",
dialect = "excel",
lineterminator = "\n",
)
writer.writeheader()
writer.writerows(SqID_data)
print("write rows complete")
正在处理的HTML摘录:
<p>
<TABLE border=0 cellspacing=1 cellpadding=6 align=center class="vTable">
<TR>
<TD class=tablehead>Financial Year</TD>
<TD class=t1>01-Apr-2015 To 31-Mar-2016</TD>
</TR>
</TABLE>
</p>
<p>
<br>
<table cellpadding=3 cellspacing=1 class=vTable>
<TR>
<TD class=t1><b>Total income from operations (net) ( a + b)</b></td>
<TD class=t0 nowrap>675529.00</td>
</tr>
<TR>
<TD class=t1><b>Total expenses</b></td>
<TD class=t0 nowrap>446577.00</td>
</tr>
<TR>
<TD class=t1>Tax expense</td>
<TD class=t0 nowrap>71708.00</td>
</tr>
<TR>
<TD class=t1><b>Net Profit / (Loss)</b></td>
<TD class=t0 nowrap>157621</td>
</tr>
</table>
</p>
SIDs.csv(无标题行)
1,A0001
2,A0002
3,A0003
预期输出:output.csv(创建标题行)
ID,Financial Year,Total Income,Total Expenses,Tax Expense,Net Profit,OtherFieldsAsAndWhenFound
A001,01-Apr-2015 To 31-Mar-2016,675529.00,446577.00,71708.00,157621.00
A002,....
A003,....
答案 0 :(得分:0)
我建议您查看pandas.read_html
来解析您的网络数据;在您的示例数据上,这将为您提供:
import pandas as pd
tables=pd.read_html(s, index_col=0)
tables[0]
Out[11]:
1
0
Financial Year 01-Apr-2015 To 31-Mar-2016
tables[1]
1
0
Total income from operations (net) ( a + b) 675529
Total expenses 446577
Tax expense 71708
Net Profit / (Loss) 157621
然后,您可以使用Pandas函数执行所需的数据操作(添加id等),然后使用DataFrame.to_csv
导出。