python将美丽的汤数据解析为csv

时间:2015-10-11 23:10:43

标签: python parsing python-3.x export-to-csv

我在python3中编写代码来解析html / css表。有一些问题:

  1. 我的csv输出文件头不是由我的代码基于html(tag:td,class:t1)生成的(在创建输出文件时的第一次运行时)
  2. 如果传入的html表有一些额外的字段(标记:td,类:t1),我的代码当前无法捕获它们并在csv输出文件中创建其他标题
  3. 数据不会写入输出cvs文件,直到处理完输入文件中的所有ID(A001,A002,A003 ...)。我想在完成输入文件中每个id的处理时写入输出cvs文件(即在处理A002之前将A001写入csv)。
  4. 每当我重新运行代码时,数据都不会从输出csv中的下一行开始
  5. 作为一个菜鸟,我确信我的代码非常简陋,并且会有更好的方法来做到这一点,并希望学会更好地编写这个并修复上述内容。

    需要建议&指导,请帮忙。谢谢。

    我的代码:

    import csv
    import requests
    from bs4 import BeautifulSoup
    
    ## SIDs.csv contains ids in col2 based on which the 'url' variable pulls the respective data
    SIDFile = open('SIDs.csv')
    SIDReader = csv.reader(SIDFile)
    SID = list(SIDReader)
    
    SqID_data = []
    
    #create and open output file
    with open('output.csv','a', newline='') as csv_h:
        fields = \
        [
            "ID",
            "Financial Year",
            "Total Income",
            "Total Expenses",
            "Tax Expense",
            "Net Profit"
        ]
    
        for row in SID:
            col1,col2 = row
            SID ="%s" % (col2)
    
            url = requests.get("http://.......")
            soup = BeautifulSoup(url.text, "lxml")
    
            fy = soup.findAll('td',{'class':'tablehead'})
            titles = soup.findAll('td',{'class':'t1'})
            values = soup.findAll('td',{'class':'t0'})
    
            if titles:
                data = {}
                for title in titles:
                    name = title.find("td", class_ = "t1")
                data["ID"] = SID
                data["Financial Year"] = fy[0].string.strip()
                data["Total Income"] = values[0].string.strip()
                data["Total Expenses"] = values[1].string.strip()
                data["Tax Expense"] = values[2].string.strip()
                data["Net Profit"] = values[3].string.strip()
                SqID_data.append(data)
    
        #Prepare CSV writer.
        writer = csv.DictWriter\
        (
            csv_h,
            fields,
            quoting        = csv.QUOTE_ALL,
            extrasaction   = "ignore",
            dialect        = "excel",
            lineterminator = "\n",
        )
        writer.writeheader()
        writer.writerows(SqID_data)
        print("write rows complete")
    

    正在处理的HTML摘录:

    <p>
    <TABLE border=0 cellspacing=1 cellpadding=6 align=center class="vTable">
       <TR>
        <TD class=tablehead>Financial Year</TD>
        <TD class=t1>01-Apr-2015 To 31-Mar-2016</TD>
       </TR>
    </TABLE>
    </p>
    
    <p>
    <br>
    <table cellpadding=3 cellspacing=1 class=vTable>
    <TR>
        <TD class=t1><b>Total income from operations (net) ( a + b)</b></td>
        <TD class=t0 nowrap>675529.00</td>
    </tr>
    <TR>
        <TD class=t1><b>Total expenses</b></td>
        <TD class=t0 nowrap>446577.00</td>
    </tr>
    <TR>
        <TD class=t1>Tax expense</td>
        <TD class=t0 nowrap>71708.00</td>
    </tr>
    <TR>
        <TD class=t1><b>Net Profit / (Loss)</b></td>
        <TD class=t0 nowrap>157621</td>
    </tr>
    </table>
    </p>
    

    SIDs.csv(无标题行)

    1,A0001
    2,A0002
    3,A0003
    

    预期输出:output.csv(创建标题行)

    ID,Financial Year,Total Income,Total Expenses,Tax Expense,Net Profit,OtherFieldsAsAndWhenFound
    A001,01-Apr-2015 To 31-Mar-2016,675529.00,446577.00,71708.00,157621.00
    A002,....
    A003,....
    

1 个答案:

答案 0 :(得分:0)

我建议您查看pandas.read_html来解析您的网络数据;在您的示例数据上,这将为您提供:

import pandas as pd
tables=pd.read_html(s, index_col=0)
tables[0]
Out[11]: 
                                         1
0                                         
Financial Year  01-Apr-2015 To 31-Mar-2016

tables[1]
                                                  1
0                                                  
Total income from operations (net) ( a + b)  675529
Total expenses                               446577
Tax expense                                   71708
Net Profit / (Loss)                          157621

然后,您可以使用Pandas函数执行所需的数据操作(添加id等),然后使用DataFrame.to_csv导出。