使用BeautifulSoup在Python中将单个值插入嵌套列表

时间:2013-08-05 21:09:49

标签: python list beautifulsoup

我有一个程序从一堆html文档中提取数据,然后将该数据写入csv文件。我正在使用工作订单号列表告诉应用程序打开哪些目录,并且我想将当前工作订单号插入到它对应的每一行中。我无法弄清楚如何将该woNumber值插入到我用来创建我的csv文件的嵌套列表中。这是完整的应用程序:

import os, bs4, csv
from bs4 import BeautifulSoup

csvOut ='C:\\GLO\\Scripts\\engineering_out'
woFile = 'C:\\glo\\wos.txt'
rows = []

f = open(woFile, 'wb')
#dirList = os.listdir('E:\\Elements\\Disaster Recovery Program\\Engineering Home')
dirList = os.listdir('C:\\GLO\\Test')

for line in dirList:
    f.write(str(line) + '\n')
f.close()

f = open(woFile, 'rb')

for i in f:

    woNumber = i.rstrip('\n')

    #File paths structure
    woMetaPath = 'E:\\Elements\\Disaster Recovery Program\\Engineering Home\\'+ woNumber + '\\DocumentLibrary-Correspondence\\Correspondence.xls'

    if os.path.exists(woMetaPath):
        #Open the metadata file, find the table, and extract the data
        html = open(woMetaPath)

        soup = BeautifulSoup(html)
        table = soup.find("table")
        for row in table.find_all('tr'):
            rows.append([val.text.encode('utf-8') for val in row.find_all('td')])    
        print 'Done with ' + woNumber
    #print 'Done with reading meta data files!'

with open((csvOut + '.csv'), 'wb') as f:
    writer = csv.writer(f, delimiter='*')
    writer.writerows(row for row in rows if row)

print 'Done writing CSV file!'

我正在从中提取的数据文件的结构示例如下所示:

<table id="_tblListView" border="0" style="border-color:Silver;border-width:1px;border-style:Solid;">
    <tr style="background-color:Gainsboro;font-weight:bold;">
        <td>Bid Package ID</td><td>Date Rec'vd</td><td>Description</td><td>File Type</td><td>Name</td><td>Title</td>
    </tr><tr>
        <td></td><td></td><td></td><td>pdf</td><td>WO 10101-1 Exhibit A, B.pdf</td><td>WO 10101-1 Exhibit A, B</td>
    </tr><tr>
        <td>10101-1_BID1</td><td></td><td></td><td>xlsx</td><td>10101-1_BID1_PC Submittal_Review Checklist.xlsx</td><td></td>
    </tr><tr>
        <td>10101-1_BID1</td><td></td><td></td><td>xlsx</td><td>10101-1_BID1_General_Review Checklist.xlsx</td><td></td>
    </tr><tr>
        <td>10101-1_BID1</td><td></td><td>10101-1 BID1 Generator Review Checklist</td><td>xlsx</td><td>10101-1_BID1_Generator_Review_Checklist.xlsx</td><td>10101-1 BID1 Generator Review Checklist</td>
    </tr><tr>
        <td>10101-1_BID1</td><td></td><td></td><td>xlsx</td><td>10101-1-1_BID1_Bid Documents_Review.xlsx</td><td></td>
    </tr><tr>
        <td>10101-1_BID1</td><td>4/29/2010 12:00:00 AM</td><td>Brazoria County_City of Angleton_Lift Stations 1, 4, & 7</td><td>xlsx</td><td>10101-1_bid1_ Equip Foundation_Review Checklist.xlsx</td><td>Equipment Foundation Review</td>
    </tr><tr>
        <td></td><td></td><td></td><td>msg</td><td>FW  WO_10101-1_BID1_60_1.msg</td><td></td>
    </tr><tr>
        <td></td><td></td><td></td><td>msg</td><td>RE  WO_10101-1_BID1_Final_1.msg</td><td></td>
    </tr><tr>
        <td></td><td></td><td></td><td>msg</td><td>WO_10101-1_BID1_Final_1_Accepted.msg</td><td></td>
    </tr><tr>
        <td>10101-1_BID2</td><td>6/17/2010 12:00:00 AM</td><td></td><td>xlsx</td><td>10101-1_BID2_Bid_Documents_Review Checklist.xlsx</td><td></td>
    </tr><tr>
        <td>10101-1_BID2</td><td>6/17/2010 12:00:00 AM</td><td>10101-1_BID2_General_Review_Checklist
</td><td>xlsx</td><td>10101-1_BID2_General_Review_Checklist.xlsx</td><td>10101-1_BID2_General_Review_Checklist</td>
    </tr><tr>
        <td>10101-1_BID2</td><td>6/17/2010 12:00:00 AM</td><td>10101-1_BID2_Generator_Review_Checklist</td><td>xlsx</td><td>10101-1_BID2_Generator_Review_Checklist.xlsx</td><td>10101-1_BID2_Generator_Review_Checklist</td>
    </tr><tr>
        <td>10101-1_BID2</td><td></td><td></td><td>xlsx</td><td>10101-1_BID2_PC Submittal_Review Checklist.xlsx</td><td></td>
    </tr><tr>
        <td>10101-1_BID1</td><td></td><td></td><td>msg</td><td>WO 10101-1_BID1_Final_1 Rejected.msg</td><td></td>
    </tr><tr>
        <td>10101-1_BID1</td><td></td><td></td><td>msg</td><td>WO 10101-1_BID1_Final_2.msg</td><td></td>
    </tr><tr>
        <td>10101-1_BID2</td><td></td><td></td><td>msg</td><td>WO 10101-1_BID2_60_2.msg</td><td></td>
    </tr><tr>
        <td>10101-1_BID2</td><td></td><td></td><td>msg</td><td>WO 10101-1_BID2_Final_1 Brazoria Co (Iowa Colony).msg</td><td></td>
    </tr><tr>
        <td>10101-1_BID3</td><td>10/4/2010 12:00:00 AM</td><td>10101-1_BID3_Generator_Review_Checklist</td><td>xlsx</td><td>10101-1_BID3_Generator_Review_Checklist.xlsx</td><td>10101-1_BID3_Generator_Review_Checklist</td>
    </tr><tr>
        <td>10101-1_BID3</td><td>10/4/2010 12:00:00 AM</td><td>10101-1_BID3_General_Review_Checklist</td><td>xlsx</td><td>10101-1_BID3_General_Review_Checklist.xlsx</td><td>10101-1_BID3_General_Review_Checklist</td>
    </tr><tr>
        <td>10101-1_BID3</td><td></td><td></td><td>xlsx</td><td>10101-1_BID3_Building (Structural)_Review Checklists.xlsx</td><td>Commodore Cove Waste Wter and Water Plant</td>
    </tr><tr>
        <td>10101-1_BID3</td><td></td><td></td><td>xlsx</td><td>10101-1_BID3_PC Submittal_Review Checklist.xlsx</td><td></td>
    </tr><tr>
        <td>10101-1_BID3</td><td>10/5/2010 12:00:00 AM</td><td></td><td>xlsx</td><td>10101-1_BID3_Bid Documents_Review.xlsx</td><td></td>
    </tr><tr>
        <td>10101-1_BID3</td><td></td><td></td><td>msg</td><td>WO 10101-1_BID3_60_1 Brazoria Co.msg</td><td></td>
    </tr><tr>
        <td>10101-1_BID3</td><td></td><td></td><td>msg</td><td>WO 10101-1_BID3_Final_1 Brazoria Co.msg</td><td></td>
    </tr><tr>
        <td>10101-1_BID3</td><td></td><td></td><td>msg</td><td>RE  WO 10101-1_BID3_Final_1 Brazoria Co.msg</td><td></td>
    </tr><tr>
        <td>10101-1_BID3</td><td></td><td></td><td>msg</td><td>RE  WO 10101-1_BID3 Brazoria Co Scope Variance.msg</td><td></td>
    </tr><tr>
        <td>10101-1_BID5</td><td></td><td></td><td>xlsx</td><td>10101-1_BID5_General_Review Checklist.xlsx</td><td></td>
    </tr><tr>
        <td>10101-1_BID5</td><td></td><td></td><td>xlsx</td><td>10101-1_BID5_Building (Structural)_Review Checklist.xlsx</td><td></td>
    </tr><tr>
        <td>10101-1_BID5</td><td></td><td></td><td>xlsx</td><td>10101-1_BID5_PC_Submittal_Review_Checklist.xlsx</td><td></td>
    </tr><tr>
        <td>10101-1_BID5</td><td></td><td></td><td>msg</td><td>WO 10101-1_BID5_30_1 Brazoria County.msg</td><td></td>
    </tr><tr>
        <td>10101-1_BID5</td><td></td><td>Brazoria County, Commodore Cove Elevated electrical and chlorine buildings at water plant and WWTP.</td><td>xlsx</td><td>10101-1_BID5 Checklist for Bid Docs.xlsx</td><td>Checklist for Bid Documents</td>
    </tr><tr>
        <td>10101-1_BID5</td><td></td><td></td><td>msg</td><td>WO 10101-1_BID5_Final_1 Brazoria County.msg</td><td></td>
    </tr>
</table>

例如,我所拥有的代码将第三行写为:

10101-1_BID1***xlsx*10101-1_BID1_PC Submittal_Review Checklist.xlsx*

我希望如此:

10101-1*10101-1_BID1***xlsx*10101-1_BID1_PC Submittal_Review Checklist.xlsx*

1 个答案:

答案 0 :(得分:0)

由于您正在处理的每个文件都有woNumber,因此您可以更改:

for row in table.find_all('tr'):
    rows.append([val.text.encode('utf-8') for val in row.find_all('td')])

要:

for row in table.find_all('tr'):
    rows.append([woNumber] + [val.text.encode('utf-8') for val in row.find_all('td')])

因此rows中的每一行都包含您当时正在处理的文件的woNumber。

注意:这将改变以下逻辑:

writer.writerows(row for row in rows if row)

现在每个行都包含它的woNumber,即使没有其他信息。因此,您可以将其更改为:

writer.writerows(row for row in rows if len(row) != 1)

仅输出信息而不仅仅是woNumber的行。