我有一个程序从一堆html文档中提取数据,然后将该数据写入csv文件。我正在使用工作订单号列表告诉应用程序打开哪些目录,并且我想将当前工作订单号插入到它对应的每一行中。我无法弄清楚如何将该woNumber值插入到我用来创建我的csv文件的嵌套列表中。这是完整的应用程序:
import os, bs4, csv
from bs4 import BeautifulSoup
csvOut ='C:\\GLO\\Scripts\\engineering_out'
woFile = 'C:\\glo\\wos.txt'
rows = []
f = open(woFile, 'wb')
#dirList = os.listdir('E:\\Elements\\Disaster Recovery Program\\Engineering Home')
dirList = os.listdir('C:\\GLO\\Test')
for line in dirList:
f.write(str(line) + '\n')
f.close()
f = open(woFile, 'rb')
for i in f:
woNumber = i.rstrip('\n')
#File paths structure
woMetaPath = 'E:\\Elements\\Disaster Recovery Program\\Engineering Home\\'+ woNumber + '\\DocumentLibrary-Correspondence\\Correspondence.xls'
if os.path.exists(woMetaPath):
#Open the metadata file, find the table, and extract the data
html = open(woMetaPath)
soup = BeautifulSoup(html)
table = soup.find("table")
for row in table.find_all('tr'):
rows.append([val.text.encode('utf-8') for val in row.find_all('td')])
print 'Done with ' + woNumber
#print 'Done with reading meta data files!'
with open((csvOut + '.csv'), 'wb') as f:
writer = csv.writer(f, delimiter='*')
writer.writerows(row for row in rows if row)
print 'Done writing CSV file!'
我正在从中提取的数据文件的结构示例如下所示:
<table id="_tblListView" border="0" style="border-color:Silver;border-width:1px;border-style:Solid;">
<tr style="background-color:Gainsboro;font-weight:bold;">
<td>Bid Package ID</td><td>Date Rec'vd</td><td>Description</td><td>File Type</td><td>Name</td><td>Title</td>
</tr><tr>
<td></td><td></td><td></td><td>pdf</td><td>WO 10101-1 Exhibit A, B.pdf</td><td>WO 10101-1 Exhibit A, B</td>
</tr><tr>
<td>10101-1_BID1</td><td></td><td></td><td>xlsx</td><td>10101-1_BID1_PC Submittal_Review Checklist.xlsx</td><td></td>
</tr><tr>
<td>10101-1_BID1</td><td></td><td></td><td>xlsx</td><td>10101-1_BID1_General_Review Checklist.xlsx</td><td></td>
</tr><tr>
<td>10101-1_BID1</td><td></td><td>10101-1 BID1 Generator Review Checklist</td><td>xlsx</td><td>10101-1_BID1_Generator_Review_Checklist.xlsx</td><td>10101-1 BID1 Generator Review Checklist</td>
</tr><tr>
<td>10101-1_BID1</td><td></td><td></td><td>xlsx</td><td>10101-1-1_BID1_Bid Documents_Review.xlsx</td><td></td>
</tr><tr>
<td>10101-1_BID1</td><td>4/29/2010 12:00:00 AM</td><td>Brazoria County_City of Angleton_Lift Stations 1, 4, & 7</td><td>xlsx</td><td>10101-1_bid1_ Equip Foundation_Review Checklist.xlsx</td><td>Equipment Foundation Review</td>
</tr><tr>
<td></td><td></td><td></td><td>msg</td><td>FW WO_10101-1_BID1_60_1.msg</td><td></td>
</tr><tr>
<td></td><td></td><td></td><td>msg</td><td>RE WO_10101-1_BID1_Final_1.msg</td><td></td>
</tr><tr>
<td></td><td></td><td></td><td>msg</td><td>WO_10101-1_BID1_Final_1_Accepted.msg</td><td></td>
</tr><tr>
<td>10101-1_BID2</td><td>6/17/2010 12:00:00 AM</td><td></td><td>xlsx</td><td>10101-1_BID2_Bid_Documents_Review Checklist.xlsx</td><td></td>
</tr><tr>
<td>10101-1_BID2</td><td>6/17/2010 12:00:00 AM</td><td>10101-1_BID2_General_Review_Checklist
</td><td>xlsx</td><td>10101-1_BID2_General_Review_Checklist.xlsx</td><td>10101-1_BID2_General_Review_Checklist</td>
</tr><tr>
<td>10101-1_BID2</td><td>6/17/2010 12:00:00 AM</td><td>10101-1_BID2_Generator_Review_Checklist</td><td>xlsx</td><td>10101-1_BID2_Generator_Review_Checklist.xlsx</td><td>10101-1_BID2_Generator_Review_Checklist</td>
</tr><tr>
<td>10101-1_BID2</td><td></td><td></td><td>xlsx</td><td>10101-1_BID2_PC Submittal_Review Checklist.xlsx</td><td></td>
</tr><tr>
<td>10101-1_BID1</td><td></td><td></td><td>msg</td><td>WO 10101-1_BID1_Final_1 Rejected.msg</td><td></td>
</tr><tr>
<td>10101-1_BID1</td><td></td><td></td><td>msg</td><td>WO 10101-1_BID1_Final_2.msg</td><td></td>
</tr><tr>
<td>10101-1_BID2</td><td></td><td></td><td>msg</td><td>WO 10101-1_BID2_60_2.msg</td><td></td>
</tr><tr>
<td>10101-1_BID2</td><td></td><td></td><td>msg</td><td>WO 10101-1_BID2_Final_1 Brazoria Co (Iowa Colony).msg</td><td></td>
</tr><tr>
<td>10101-1_BID3</td><td>10/4/2010 12:00:00 AM</td><td>10101-1_BID3_Generator_Review_Checklist</td><td>xlsx</td><td>10101-1_BID3_Generator_Review_Checklist.xlsx</td><td>10101-1_BID3_Generator_Review_Checklist</td>
</tr><tr>
<td>10101-1_BID3</td><td>10/4/2010 12:00:00 AM</td><td>10101-1_BID3_General_Review_Checklist</td><td>xlsx</td><td>10101-1_BID3_General_Review_Checklist.xlsx</td><td>10101-1_BID3_General_Review_Checklist</td>
</tr><tr>
<td>10101-1_BID3</td><td></td><td></td><td>xlsx</td><td>10101-1_BID3_Building (Structural)_Review Checklists.xlsx</td><td>Commodore Cove Waste Wter and Water Plant</td>
</tr><tr>
<td>10101-1_BID3</td><td></td><td></td><td>xlsx</td><td>10101-1_BID3_PC Submittal_Review Checklist.xlsx</td><td></td>
</tr><tr>
<td>10101-1_BID3</td><td>10/5/2010 12:00:00 AM</td><td></td><td>xlsx</td><td>10101-1_BID3_Bid Documents_Review.xlsx</td><td></td>
</tr><tr>
<td>10101-1_BID3</td><td></td><td></td><td>msg</td><td>WO 10101-1_BID3_60_1 Brazoria Co.msg</td><td></td>
</tr><tr>
<td>10101-1_BID3</td><td></td><td></td><td>msg</td><td>WO 10101-1_BID3_Final_1 Brazoria Co.msg</td><td></td>
</tr><tr>
<td>10101-1_BID3</td><td></td><td></td><td>msg</td><td>RE WO 10101-1_BID3_Final_1 Brazoria Co.msg</td><td></td>
</tr><tr>
<td>10101-1_BID3</td><td></td><td></td><td>msg</td><td>RE WO 10101-1_BID3 Brazoria Co Scope Variance.msg</td><td></td>
</tr><tr>
<td>10101-1_BID5</td><td></td><td></td><td>xlsx</td><td>10101-1_BID5_General_Review Checklist.xlsx</td><td></td>
</tr><tr>
<td>10101-1_BID5</td><td></td><td></td><td>xlsx</td><td>10101-1_BID5_Building (Structural)_Review Checklist.xlsx</td><td></td>
</tr><tr>
<td>10101-1_BID5</td><td></td><td></td><td>xlsx</td><td>10101-1_BID5_PC_Submittal_Review_Checklist.xlsx</td><td></td>
</tr><tr>
<td>10101-1_BID5</td><td></td><td></td><td>msg</td><td>WO 10101-1_BID5_30_1 Brazoria County.msg</td><td></td>
</tr><tr>
<td>10101-1_BID5</td><td></td><td>Brazoria County, Commodore Cove Elevated electrical and chlorine buildings at water plant and WWTP.</td><td>xlsx</td><td>10101-1_BID5 Checklist for Bid Docs.xlsx</td><td>Checklist for Bid Documents</td>
</tr><tr>
<td>10101-1_BID5</td><td></td><td></td><td>msg</td><td>WO 10101-1_BID5_Final_1 Brazoria County.msg</td><td></td>
</tr>
</table>
例如,我所拥有的代码将第三行写为:
10101-1_BID1***xlsx*10101-1_BID1_PC Submittal_Review Checklist.xlsx*
我希望如此:
10101-1*10101-1_BID1***xlsx*10101-1_BID1_PC Submittal_Review Checklist.xlsx*
答案 0 :(得分:0)
由于您正在处理的每个文件都有woNumber,因此您可以更改:
for row in table.find_all('tr'):
rows.append([val.text.encode('utf-8') for val in row.find_all('td')])
要:
for row in table.find_all('tr'):
rows.append([woNumber] + [val.text.encode('utf-8') for val in row.find_all('td')])
因此rows
中的每一行都包含您当时正在处理的文件的woNumber。
注意:这将改变以下逻辑:
writer.writerows(row for row in rows if row)
现在每个行都包含它的woNumber,即使没有其他信息。因此,您可以将其更改为:
writer.writerows(row for row in rows if len(row) != 1)
仅输出信息而不仅仅是woNumber的行。