将html数据提取到Excel电子表格

时间:2017-02-15 19:02:07

标签: python html excel

我正在尝试使用python'读取'html文档并将输出写入excel电子表格。 HTML文件是CU的表(成本单位,由所有大写字母定义)和描述。我想将CU列在一列中,将相应的描述放在另一列中。我有一个全局存储文本的一部分,直到它到达CU然后将文本放入正确的列但由于某种原因代码不会完成所有CU的列表,它不会将描述放在正确的位置(将它们放在一个从适用的CU下来的列。任何人都可以帮我弄清楚我做错了什么?到目前为止,这是我的代码:

from HTMLParser import HTMLParser
import xlwt
global wb
global ws
global cucounter
global textcounter
global tempcu
textstore = ""
cucounter = 0
textcounter = 0
wb = xlwt.Workbook()
ws = wb.add_sheet('A Test Sheet')
filename = 'C:\\Python27\\ArcGIS10.3\\Doc\\Page.html'
f = open(filename, "r").read()

class MyHTMLParser(HTMLParser):

    def handle_data(self, data):
        if data.isupper():
             try:
                  global cucounter
                  ws.write(cucounter, 1, data)
                  cucounter = cucounter + 1
                  wb.save('ElecTest.xls')
             except UnicodeDecodeError:
                  pass
        if data.isspace():
              pass
        else:
            try:
             global textstore
             textstore += str(data)
             if data.isupper():
                  global textstore
                  global textcounter
                  ws.write(textcounter, 2, textstore)
                  textcounter = textcounter + 1
                  textstore = ""
                  wb.save('ElectTest.xls')
            except UnicodeDecodeError:
                  pass



parser = MyHTMLParser()
parser.feed(f)

遗憾的是,我无法以正确的格式添加我的HTML文件(如果我可以将UnicodeDecodeError处理有意义),但这是我可以复制的内容:

页面C / ​​U描述: M-M

EULBPIT     Excavate, backfill & tamp auger pit or primary splice hole.  Qty "1" per occurrence.  4'X4'X5' pit. 
EULBCOMPWHEEL   Wheel Compaction - Tamping with wheel, where initial lift is rolled, trench filled & crowned and rolled again and where additional traffic is expected in location assists with tamping. 
EULBCOMP85STD   85% Std. Proctor Compaction - Trench where subsidence is unsettled and probable due to nature of area, needing compaction equipment w/ 12” lifts, use in parking lots, adjacent to roadways & front lot line URD. 
EULBCOMP85MOD   85% Modified Proctor Compaction - Trenches under hard surfaces of roadway, more rigid than std, requiring compaction equipment w/ maximum 12” lifts, minimum12” lift from cable, soil and moisture content critical, hand test required also. 
EULBCOMP95STD   95% Std. Proctor Compaction - Used by most local jurisdictions, close to, but more than, 85% but needing more moisture, 12” lifts should be used and hand test for adequate moisture. 
EULBCOMP95MOD   95% Modified Proctor Compaction - Trenches under hard surfaces of roadway, more rigid than std, requiring compaction equipment w/ maximum 12” lifts, minimum12” lift from cable, soil and moisture content critical, hand test required. 
EULBCOMP    Compaction Test 
EULBSHORE   Shoring, 5’ high, 2-sided per ft per day 
EULBTHAWU   Thaw master/UG work:  Specify "1" in install column only.  Includes install, remove, lighting, & setting (2) burners with propane tank. 
EULBJACKHAMMER  Jackhammer:  Specify per sq ft X 4" deep.  Install column only. 
EULBHANDIKRETE  Handikrete.  Install only-Specify "1" per cu ft (1-bag). 
EUCDJACK4STPIPE     Jack 4" galvanized steel pipe - includes pipe & coupling.  Set up and dismantle jacking equipment.    Specify "1" per ft. 
EUCDJACK5STPIPE     Jack 5" galvanized steel pipe - includes pipe & coupling.  Set up and dismantle jacking equipment.  Specify "1" per ft. 
EUCDJACK6STPIPE     Jack 6" galvanized steel pipe - includes pipe & coupling.  Set up and dismantle jacking equipment.  Specify "1" per ft. 
EUCDIN-OUTJACK  Setting up & dismantle jacking equipment.  Includes digging & filling of pits.  Specify "1" per occurrence in the install column only. 
EUCDCASE24  Jack 24" casing--Specify "1" per ft - does not include pipe. 
EULBYSNOW   Snow removal.  Install column only; specify “1” for every 2 man-hours. 
EULBCLEANADJUST     Clean or adjust switchgear. Install only; specify “1” per occurrence. 
EULBUGLC    Install or remove line covers. Specify # of covers and occurrences. 
EULBLOWRCBL     Lowering cable - specify per linear ft.  Install column only. 
EULBGROUNDCBL   Install, remove or test for ground on cable.  Specify “1” per occurrence. 
EULBMOVECBL     Place terminator on stand-off or energized bushing.  Specify “1” per occurrence. Install column only. 
EULBPHASE-U     Phase-in UG conductor.  Install only; specify “1” per occurrence. 
EULBTRANSRISER  Transfer riser cable. Specify # of cables; install only:  specify “1” per occurrence. 
EULBLOCATEFAULT     Find UG cable fault - Install column only; specify “1” per occurrence. 
EULBCBLIDTESTER     Identify cable with impulse phaser.  Specify “1” per occurrence 
EULBPIERCECBL   Ground pierce cable - Install column only.  Specify “1” per occurrence. 
EULBSWITCH  Switch URD 600 A PMH gear.  Specify “1” per occurrence. 
EULBSWOIL   Switch-open & close OCR & Leads. Specify “1” per occurrence. 
EULBPDLCK   Padlock open and close.  Specify “1” per occurrence. 
EULBCOVERHOLE   Plywood to cover construction hole.  Specify “1’ per occurrence. 
EULBSCRTYFENCNG     Remove/replace/install security fencing (orange) around splice pit.  Specify “1” per occurrence. 
EULBDRTPKUP     Dirt pick-up: Load & haul excess dirt on site, per cu yd. 
EULBDRTPKPD     Dirt pick-up: Load & haul excess dirt off site, per cu yd. 
EULBROADBASE    Road base, labor only to install; specify "1" per cu yd. 

<html xmlns:v="urn:schemas-microsoft-com:vml"
xmlns:o="urn:schemas-microsoft-com:office:office"
xmlns:w="urn:schemas-microsoft-com:office:word"
xmlns:m="http://schemas.microsoft.com/office/2004/12/omml"
xmlns="http://www.w3.org/TR/REC-html40">

<head>
<meta http-equiv=Content-Type content="text/html; charset=windows-1252">
<meta name=ProgId content=Word.Document>
<meta name=Generator content="Microsoft Word 15">
<meta name=Originator content="Microsoft Word 15">
<link rel=File-List href="Page_files/filelist.xml">
<!--[if gte mso 9]><xml>
 <o:DocumentProperties>
  <o:Author>John Swordy</o:Author>
  <o:LastAuthor>John Swordy</o:LastAuthor>
  <o:Revision>1</o:Revision>
  <o:TotalTime>1</o:TotalTime>
  <o:Created>2017-02-15T16:44:00Z</o:Created>
  <o:LastSaved>2017-02-15T16:45:00Z</o:LastSaved>
  <o:Pages>2</o:Pages>
  <o:Words>600</o:Words>
  <o:Characters>3426</o:Characters>
  <o:Company>En Engineering</o:Company>
  <o:Lines>28</o:Lines>
  <o:Paragraphs>8</o:Paragraphs>
  <o:CharactersWithSpaces>4018</o:CharactersWithSpaces>
  <o:Version>16.00</o:Version>
 </o:DocumentProperties>
 <o:OfficeDocumentSettings>
  <o:AllowPNG/>
 </o:OfficeDocumentSettings>
</xml><![endif]-->
<link rel=themeData href="Page_files/themedata.thmx">
<link rel=colorSchemeMapping href="Page_files/colorschememapping.xml">
<!--[if gte mso 9]><xml>
 <w:WordDocument>
  <w:TrackMoves>false</w:TrackMoves>
  <w:TrackFormatting/>
  <w:PunctuationKerning/>
  <w:ValidateAgainstSchemas/>
  <w:SaveIfXMLInvalid>false</w:SaveIfXMLInvalid>
  <w:IgnoreMixedContent>false</w:IgnoreMixedContent>
  <w:AlwaysShowPlaceholderText>false</w:AlwaysShowPlaceholderText>
  <w:DoNotPromoteQF/>
  <w:LidThemeOther>EN-US</w:LidThemeOther>
  <w:LidThemeAsian>X-NONE</w:LidThemeAsian>
  <w:LidThemeComplexScript>X-NONE</w:LidThemeComplexScript>
  <w:Compatibility>
   <w:BreakWrappedTables/>
   <w:SnapToGridInCell/>
   <w:WrapTextWithPunct/>
   <w:UseAsianBreakRules/>
   <w:DontGrowAutofit/>
   <w:SplitPgBreakAndParaMark/>
   <w:EnableOpenTypeKerning/>
   <w:DontFlipMirrorIndents/>
   <w:OverrideTableStyleHps/>
  </w:Compatibility>
  <m:mathPr>
   <m:mathFont m:val="Cambria Math"/>
   <m:brkBin m:val="before"/>
   <m:brkBinSub m:val="&#45;-"/>
   <m:smallFrac m:val="off"/>
   <m:dispDef/>
   <m:lMargin m:val="0"/>
   <m:rMargin m:val="0"/>
   <m:defJc m:val="centerGroup"/>
   <m:wrapIndent m:val="1440"/>
   <m:intLim m:val="subSup"/>
   <m:naryLim m:val="undOvr"/>
  </m:mathPr></w:WordDocument>
</xml><![endif]--><!--[if gte mso 9]><xml>
 <w:LatentStyles DefLockedState="false" DefUnhideWhenUsed="false"
  DefSemiHidden="false" DefQFormat="false" DefPriority="99"
  LatentStyleCount="371">
  <w:LsdException Locked="false" Priority="0" QFormat="true" Name="Normal"/>
  <w:LsdException Locked="false" Priority="9" QFormat="true" Name="heading 1"/>
  <w:LsdException Locked="false" Priority="9" SemiHidden="true"
   UnhideWhenUsed="true" QFormat="true" Name="heading 2"/>
  <w:LsdException Locked="false" Priority="9" SemiHidden="true"
   UnhideWhenUsed="true" QFormat="true" Name="heading 3"/>
  <w:LsdException Locked="false" Priority="9" SemiHidden="true"
   UnhideWhenUsed="true" QFormat="true" Name="heading 4"/>
  <w:LsdException Locked="false" Priority="9" SemiHidden="true"
   UnhideWhenUsed="true" QFormat="true" Name="heading 5"/>
  <w:LsdException Locked="false" Priority="9" SemiHidden="true"
   UnhideWhenUsed="true" QFormat="true" Name="heading 6"/>
  <w:LsdException Locked="false" Priority="9" SemiHidden="true"
   UnhideWhenUsed="true" QFormat="true" Name="heading 7"/>
  <w:LsdException Locked="false" Priority="9" SemiHidden="true"
   UnhideWhenUsed="true" QFormat="true" Name="heading 8"/>
  <w:LsdException Locked="false" Priority="9" SemiHidden="true"
   UnhideWhenUsed="true" QFormat="true" Name="heading 9"/>
  <w:LsdException Locked="false" SemiHidden="true" UnhideWhenUsed="true"
   Name="index 1"/>
  <w:LsdException Locked="false" SemiHidden="true" UnhideWhenUsed="true"
   Name="index 2"/>
  <w:LsdException Locked="false" SemiHidden="true" UnhideWhenUsed="true"
   Name="index 3"/>

如果有人能帮助我,我将非常感谢,谢谢你的时间!注意:我是自学成才并且对python来说相对较新,所以我提前为可能不太好看的代码道歉。

0 个答案:

没有答案