Question

我在python3中编写代码来解析html / css表。有一些问题：

我的csv输出文件头不是由我的代码基于html（tag：td，class：t1）生成的（在创建输出文件时的第一次运行时）
如果传入的html表有一些额外的字段（标记：td，类：t1），我的代码当前无法捕获它们并在csv输出文件中创建其他标题
数据不会写入输出cvs文件，直到处理完输入文件中的所有ID（A001，A002，A003 ...）。我想在完成输入文件中每个id的处理时写入输出cvs文件（即在处理A002之前将A001写入csv）。
每当我重新运行代码时，数据都不会从输出csv中的下一行开始

作为一个菜鸟，我确信我的代码非常简陋，并且会有更好的方法来做到这一点，并希望学会更好地编写这个并修复上述内容。

需要建议＆amp;指导，请帮忙。谢谢。

我的代码：

import csv
import requests
from bs4 import BeautifulSoup

## SIDs.csv contains ids in col2 based on which the 'url' variable pulls the respective data
SIDFile = open('SIDs.csv')
SIDReader = csv.reader(SIDFile)
SID = list(SIDReader)

SqID_data = []

#create and open output file
with open('output.csv','a', newline='') as csv_h:
    fields = \
    [
        "ID",
        "Financial Year",
        "Total Income",
        "Total Expenses",
        "Tax Expense",
        "Net Profit"
    ]

    for row in SID:
        col1,col2 = row
        SID ="%s" % (col2)

        url = requests.get("http://.......")
        soup = BeautifulSoup(url.text, "lxml")

        fy = soup.findAll('td',{'class':'tablehead'})
        titles = soup.findAll('td',{'class':'t1'})
        values = soup.findAll('td',{'class':'t0'})

        if titles:
            data = {}
            for title in titles:
                name = title.find("td", class_ = "t1")
            data["ID"] = SID
            data["Financial Year"] = fy[0].string.strip()
            data["Total Income"] = values[0].string.strip()
            data["Total Expenses"] = values[1].string.strip()
            data["Tax Expense"] = values[2].string.strip()
            data["Net Profit"] = values[3].string.strip()
            SqID_data.append(data)

    #Prepare CSV writer.
    writer = csv.DictWriter\
    (
        csv_h,
        fields,
        quoting        = csv.QUOTE_ALL,
        extrasaction   = "ignore",
        dialect        = "excel",
        lineterminator = "\n",
    )
    writer.writeheader()
    writer.writerows(SqID_data)
    print("write rows complete")

正在处理的HTML摘录：

<p>
<TABLE border=0 cellspacing=1 cellpadding=6 align=center class="vTable">
   <TR>
    <TD class=tablehead>Financial Year</TD>
    <TD class=t1>01-Apr-2015 To 31-Mar-2016</TD>
   </TR>
</TABLE>
</p>

<p>
<br>
<table cellpadding=3 cellspacing=1 class=vTable>
<TR>
    <TD class=t1><b>Total income from operations (net) ( a + b)</b></td>
    <TD class=t0 nowrap>675529.00</td>
</tr>
<TR>
    <TD class=t1><b>Total expenses</b></td>
    <TD class=t0 nowrap>446577.00</td>
</tr>
<TR>
    <TD class=t1>Tax expense</td>
    <TD class=t0 nowrap>71708.00</td>
</tr>
<TR>
    <TD class=t1><b>Net Profit / (Loss)</b></td>
    <TD class=t0 nowrap>157621</td>
</tr>
</table>
</p>

SIDs.csv（无标题行）

1,A0001
2,A0002
3,A0003

预期输出：output.csv（创建标题行）

ID,Financial Year,Total Income,Total Expenses,Tax Expense,Net Profit,OtherFieldsAsAndWhenFound
A001,01-Apr-2015 To 31-Mar-2016,675529.00,446577.00,71708.00,157621.00
A002,....
A003,....

Answer 1

我建议您查看pandas.read_html来解析您的网络数据;在您的示例数据上，这将为您提供：

import pandas as pd
tables=pd.read_html(s, index_col=0)
tables[0]
Out[11]: 
                                         1
0                                         
Financial Year  01-Apr-2015 To 31-Mar-2016

tables[1]
                                                  1
0                                                  
Total income from operations (net) ( a + b)  675529
Total expenses                               446577
Tax expense                                   71708
Net Profit / (Loss)                          157621

然后，您可以使用Pandas函数执行所需的数据操作（添加id等），然后使用DataFrame.to_csv导出。

python将美丽的汤数据解析为csv

1 个答案: