BeautifulSoup HTML表解析,表

时间:2016-02-02 09:43:58

标签: python beautifulsoup html-parsing

我正在使用BeautifulSoup来解析HTML表格中的数据。 通常HTML运行如下:

        <tr><td width="35%"style="font-style: italic;">Owner:</td><td>MS 'CEMSOL' Schifffahrtsgesellschaft mbH & Co. KG</td></tr>
        <tr><td width="35%"style="font-style: italic;">Connecting District:</td><td>HAMBURG (HBR)</td></tr>
        <tr><td width="35%"style="font-style: italic;">Flag:</td><td>CYPRUS</td></tr>
        <tr><td width="35%"style="font-style: italic;">Port of Registry:</td><td>LIMASSOL</td></tr>
    </tbody></table>

但是还有部分:

<table class="table1"><thead><tr><th style="width: 140px" class="veristarTableUHeader">Classification</th><th class="top"><a href="#top">Top</a></th></tr></thead><tbody><tr><td width="35%" valign="top"style="font-style: italic;">Main Class Symbols:</td><td>I<img src='/asms2-portlet/images/particulars/croixsoul.gif'/> Hull&nbsp;&nbsp;&nbsp;<img src='/asms2-portlet/images/particulars/croixsoul.gif'/>Mach</td></tr>
    <tr><td width="35%" valign="top"style="font-style: italic;">Service Notations:</td><td valign="top">
        <table class="empty">
            <tr>

                <td>General cargo ship /cement carrier</td>
            </tr>
        </table>
    </td></tr>
    <tr><td width="35%" valign="top"style="font-style: italic;">Navigation Notations:</td><td>Unrestricted navigation<br></td></tr>
    <tr><td width="35%" valign="top"style="font-style: italic;">Additional Class Notation(s):</td><td><img src='/asms2-portlet/images/particulars/croixsoul.gif'/> AUT-UMS , <img src='/asms2-portlet/images/particulars/croixsoul.gif'/> ICE CLASS IA</td></tr>
    <tr><td width="35%" valign="top"style="font-style: italic;">Machinery:</td><td valign="top">
        <table class="empty">
            <tr>
                <td width="20"><img src='/asms2-portlet/images/particulars/croixsoul.gif'/></td>
                <td>MACH</td>
            </tr>
        </table>
    </td></tr>

</tbody></table>

来源.txt:ShipData 问题是,有一个额外的<tr>标签,通过将一个表格列加倍来强调答案普通货船/水泥运输船它将它添加两次到值列表因为tr中有一个新表。 ['12536D', '9180401', 'CEMSOL', 'C4WH2', 'General cargo ship', "MS 'CEMSOL' Schifffahrtsgesellschaft mbH & Co. KG", 'HAMBURG (HBR)', 'CYPRUS', 'LIMASSOL', 'I Hull\xc2\xa0\xc2\xa0\xc2\xa0Mach', 'General cargo ship /cement carrier', 'General cargo ship /cement carrier', 'Unrestricted navigation', ' AUT-UMS , ICE CLASS IA', 'MACH']

我的代码如下:

# -*- coding: utf-8 -*-

import csv
# from urllib import urlopen
import urllib2
from bs4 import BeautifulSoup
import re
import time
import socket
fm = open('ShipURL.txt', 'r')
Shiplinks = fm.readlines()  # Reads the line count from the first line.
check = ""
columnHeaders = ""
with open('ShipData.csv', 'wb')as f:  # Creates an empty csv file to which assign values.
    writer = csv.writer(f)
    for line in Shiplinks:
        time.sleep(2)
        website = re.findall(r'(https?://\S+)', line)
        website = "".join(str(x) for x in website)
        print website
        if check != website:
            if website != "":
                check = website
                shipUrl = website
                while True:
                    try:
                        shipPage = urllib2.urlopen(shipUrl, timeout=1.5)
                    except urllib2.URLError:
                        continue
                    except socket.timeout:
                        print "socket timeout!"
                    break
                soup = BeautifulSoup(shipPage, "html.parser")  # Read the web page HTML
                table = soup.find_all("table", {"class": "table1"})  # Finds table with class table1
                List = []
                columnRow = ""
                valueRow = ""
                Values = []
                for mytable in table:                                   #Loops tables with class table1
                    table_body = mytable.find('tbody')                  #Finds tbody section in table
                    try:                                                #If tbody exists
                        rows = table_body.find_all('tr')                #Finds all rows
                        for tr in rows:                                 #Loops rows
                            cols = tr.find_all('td')                    #Finds the columns
                            i = 1                                       #Variable to control the lines
                            for td in cols:                             #Loops the columns
            ##                  print td.text                           #Displays the output
                                co = td.text                            #Saves the column to a variable
            ##                    writer.writerow([co])                 Writes the variable in CSV file row
                                if i == 1:                              #Checks the control variable, if it equals to 1
                                    if td.text[-1] == ":":              # - : adds a ,
                                        columnRow += td.text.strip(":") + " , "  # One string
                                        List.append(td.text.encode("utf-8"))                #.. takes the column value and assigns it to a list called 'List' and..
                                        i += 1                                #..Increments i by one

                                else:
                                    # võtab reavahetused maha ja lisab koma stringile
                                    valueRow += td.text.strip("\n") + " , "
                                    Values.append(td.text.strip("\n").encode("utf-8"))              #Takes the second columns value and assigns it to a list called Values
                                #print List                             #Checking stuff
                                #print Values                           #Checking stuff
                    except:
                        print "no tbody"
                # Prints headings and values with row change.
                print columnRow.strip(",")
                print "\n"
                print valueRow.strip(",")
                # encode'ing hakkas jälle kiusama
                # Kirjutab esimeseks reaks veeru pealkirjad ja teiseks väärtused
                if not columnHeaders:
                    writer.writerow(List)
                columnHeaders = columnRow
                writer.writerow(Values)
                #
fm.close()

2 个答案:

答案 0 :(得分:0)

我知道这并不能完全帮助您解决额外的表格标签,但听起来您只需要一个解决方案,这样您就不会在列表中获得2个值!

如果我是你,我会使用一个集合,这样你就可以在你的集合中拥有一个相同的值。

要在python 2.7及更高版本中创建一个集合:

List = {}
Values = {}

注意:如果您仍然使用2.7之前的版本:

List = ([])
Values = ([])

我会将名字改为:

Set        = {}
Set_values = {}

执行此操作后,您可以继续更改代码的最后一位以解决问题!

if td.text[-1] == ":":                        
     columnRow += td.text.strip(":") + " , "  
     List.append(td.text.encode("utf-8"))     
     i += 1  
else:
   valueRow += td.text.strip("\n") + " , "
   Values.append(td.text.strip("\n").encode("utf-8"))

我会用:

if td.text[-1] == ":":                        
     columnRow += td.text.strip(":") + " , "  
     Set.add(td.text.encode("utf-8"))   #<---Here is the change  
     i += 1 
else:
    valueRow += td.text.strip("\n") + " , "
    Set_values.add(td.text.strip("\n").encode("utf-8")) #<---Here is another change

执行此操作会使您在写入CSV时只有一个相同的值。

如果CSV编辑器不喜欢设置和喜欢列表更好,您可以通过执行此操作将您的设置转换为文件末尾的列表。

my_list  = list(Set)
my_list2 = list(Set_values)

答案 1 :(得分:0)

用if子句t = 0解决问题,如果t&lt; 1然后程序将td.text元素追加到列表中。这保证了只添加一个元素。

# -*- coding: utf-8 -*-

import csv
# from urllib import urlopen
import urllib2
from bs4 import BeautifulSoup
import re
import time
import socket
fm = open('ShipURL.txt', 'r')
Shiplinks = fm.readlines()  # Reads the line count from the first line.
check = ""
columnHeaders = ""
with open('ShipData.csv', 'wb')as f:  # Creates an empty csv file to which assign values.
    writer = csv.writer(f)
    for line in Shiplinks:
        time.sleep(2)
        website = re.findall(r'(https?://\S+)', line)
        website = "".join(str(x) for x in website)
        print website
        if check != website:
            if website != "":
                check = website
                shipUrl = website
                while True:
                    try:
                        shipPage = urllib2.urlopen(shipUrl, timeout=1.5)
                    except urllib2.URLError:
                        continue
                    except socket.timeout:
                        print "socket timeout!"
                    break
                soup = BeautifulSoup(shipPage, "html.parser")  # Read the web page HTML
                table = soup.find_all("table", {"class": "table1"})  # Finds table with class table1
                List = []
                columnRow = ""
                valueRow = ""
                Values = []
                for mytable in table:                                   #Loops tables with class table1
                    table_body = mytable.find('tbody')                  #Finds tbody section in table
                    try:                                                #If tbody exists
                        rows = table_body.find_all('tr')                #Finds all rows
                        for tr in rows:                                 #Loops rows
                            cols = tr.find_all('td')                    #Finds the columns
                            i = 1                                       #Variable to control the lines
                            t = 0
                            for td in cols:                             #Loops the columns
            ##                  print td.text                           #Displays the output
                                co = td.text                           #Saves the column to a variable
            ##                    writer.writerow([co])                 Writes the variable in CSV file row
                                if i == 1:                              #Checks the control variable, if it equals to 1
                                    if td.text[-1] == ":":              # - : adds a ,
                                        columnRow += td.text.strip(":") + " , "  # One string
                                        List.append(td.text.encode("utf-8"))                #.. takes the column value and assigns it to a list called 'List' and..
                                        i += 1                                #..Increments i by one

                                else:
                                    # võtab reavahetused maha ja lisab koma stringile
                                    if t<1:
                                        valueRow += td.text.strip("\n") + " , "
                                        Values.append(td.text.strip("\n").encode("utf-8"))              #Takes the second columns value and assigns it to a list called Values
                                        t += 1
                                #print List                             #Checking stuff
                                #print Values                           #Checking stuff
                    except:
                        print "no tbody"
                # Prints headings and values with row change.
                print columnRow.strip(",")
                print "\n"
                print valueRow.strip(",")
                # encode'ing hakkas jälle kiusama
                # Kirjutab esimeseks reaks veeru pealkirjad ja teiseks väärtused
                if not columnHeaders:
                    writer.writerow(List)
                columnHeaders = columnRow
                writer.writerow(Values)
                #
fm.close()