我正在使用BeautifulSoup来解析HTML表格中的数据。 通常HTML运行如下:
<tr><td width="35%"style="font-style: italic;">Owner:</td><td>MS 'CEMSOL' Schifffahrtsgesellschaft mbH & Co. KG</td></tr>
<tr><td width="35%"style="font-style: italic;">Connecting District:</td><td>HAMBURG (HBR)</td></tr>
<tr><td width="35%"style="font-style: italic;">Flag:</td><td>CYPRUS</td></tr>
<tr><td width="35%"style="font-style: italic;">Port of Registry:</td><td>LIMASSOL</td></tr>
</tbody></table>
但是还有部分:
<table class="table1"><thead><tr><th style="width: 140px" class="veristarTableUHeader">Classification</th><th class="top"><a href="#top">Top</a></th></tr></thead><tbody><tr><td width="35%" valign="top"style="font-style: italic;">Main Class Symbols:</td><td>I<img src='/asms2-portlet/images/particulars/croixsoul.gif'/> Hull <img src='/asms2-portlet/images/particulars/croixsoul.gif'/>Mach</td></tr>
<tr><td width="35%" valign="top"style="font-style: italic;">Service Notations:</td><td valign="top">
<table class="empty">
<tr>
<td>General cargo ship /cement carrier</td>
</tr>
</table>
</td></tr>
<tr><td width="35%" valign="top"style="font-style: italic;">Navigation Notations:</td><td>Unrestricted navigation<br></td></tr>
<tr><td width="35%" valign="top"style="font-style: italic;">Additional Class Notation(s):</td><td><img src='/asms2-portlet/images/particulars/croixsoul.gif'/> AUT-UMS , <img src='/asms2-portlet/images/particulars/croixsoul.gif'/> ICE CLASS IA</td></tr>
<tr><td width="35%" valign="top"style="font-style: italic;">Machinery:</td><td valign="top">
<table class="empty">
<tr>
<td width="20"><img src='/asms2-portlet/images/particulars/croixsoul.gif'/></td>
<td>MACH</td>
</tr>
</table>
</td></tr>
</tbody></table>
来源.txt:ShipData
问题是,有一个额外的<tr>
标签,通过将一个表格列加倍来强调答案普通货船/水泥运输船它将它添加两次到值列表因为tr中有一个新表。 ['12536D', '9180401', 'CEMSOL', 'C4WH2', 'General cargo ship', "MS 'CEMSOL' Schifffahrtsgesellschaft mbH & Co. KG", 'HAMBURG (HBR)', 'CYPRUS', 'LIMASSOL', 'I Hull\xc2\xa0\xc2\xa0\xc2\xa0Mach', 'General cargo ship /cement carrier', 'General cargo ship /cement carrier', 'Unrestricted navigation', ' AUT-UMS , ICE CLASS IA', 'MACH']
我的代码如下:
# -*- coding: utf-8 -*-
import csv
# from urllib import urlopen
import urllib2
from bs4 import BeautifulSoup
import re
import time
import socket
fm = open('ShipURL.txt', 'r')
Shiplinks = fm.readlines() # Reads the line count from the first line.
check = ""
columnHeaders = ""
with open('ShipData.csv', 'wb')as f: # Creates an empty csv file to which assign values.
writer = csv.writer(f)
for line in Shiplinks:
time.sleep(2)
website = re.findall(r'(https?://\S+)', line)
website = "".join(str(x) for x in website)
print website
if check != website:
if website != "":
check = website
shipUrl = website
while True:
try:
shipPage = urllib2.urlopen(shipUrl, timeout=1.5)
except urllib2.URLError:
continue
except socket.timeout:
print "socket timeout!"
break
soup = BeautifulSoup(shipPage, "html.parser") # Read the web page HTML
table = soup.find_all("table", {"class": "table1"}) # Finds table with class table1
List = []
columnRow = ""
valueRow = ""
Values = []
for mytable in table: #Loops tables with class table1
table_body = mytable.find('tbody') #Finds tbody section in table
try: #If tbody exists
rows = table_body.find_all('tr') #Finds all rows
for tr in rows: #Loops rows
cols = tr.find_all('td') #Finds the columns
i = 1 #Variable to control the lines
for td in cols: #Loops the columns
## print td.text #Displays the output
co = td.text #Saves the column to a variable
## writer.writerow([co]) Writes the variable in CSV file row
if i == 1: #Checks the control variable, if it equals to 1
if td.text[-1] == ":": # - : adds a ,
columnRow += td.text.strip(":") + " , " # One string
List.append(td.text.encode("utf-8")) #.. takes the column value and assigns it to a list called 'List' and..
i += 1 #..Increments i by one
else:
# võtab reavahetused maha ja lisab koma stringile
valueRow += td.text.strip("\n") + " , "
Values.append(td.text.strip("\n").encode("utf-8")) #Takes the second columns value and assigns it to a list called Values
#print List #Checking stuff
#print Values #Checking stuff
except:
print "no tbody"
# Prints headings and values with row change.
print columnRow.strip(",")
print "\n"
print valueRow.strip(",")
# encode'ing hakkas jälle kiusama
# Kirjutab esimeseks reaks veeru pealkirjad ja teiseks väärtused
if not columnHeaders:
writer.writerow(List)
columnHeaders = columnRow
writer.writerow(Values)
#
fm.close()
答案 0 :(得分:0)
我知道这并不能完全帮助您解决额外的表格标签,但听起来您只需要一个解决方案,这样您就不会在列表中获得2个值!
如果我是你,我会使用一个集合,这样你就可以在你的集合中拥有一个相同的值。
要在python 2.7及更高版本中创建一个集合:
List = {}
Values = {}
注意:如果您仍然使用2.7之前的版本:
List = ([])
Values = ([])
我会将名字改为:
Set = {}
Set_values = {}
执行此操作后,您可以继续更改代码的最后一位以解决问题!
if td.text[-1] == ":":
columnRow += td.text.strip(":") + " , "
List.append(td.text.encode("utf-8"))
i += 1
else:
valueRow += td.text.strip("\n") + " , "
Values.append(td.text.strip("\n").encode("utf-8"))
我会用:
if td.text[-1] == ":":
columnRow += td.text.strip(":") + " , "
Set.add(td.text.encode("utf-8")) #<---Here is the change
i += 1
else:
valueRow += td.text.strip("\n") + " , "
Set_values.add(td.text.strip("\n").encode("utf-8")) #<---Here is another change
执行此操作会使您在写入CSV时只有一个相同的值。
如果CSV编辑器不喜欢设置和喜欢列表更好,您可以通过执行此操作将您的设置转换为文件末尾的列表。
my_list = list(Set)
my_list2 = list(Set_values)
答案 1 :(得分:0)
用if子句t = 0解决问题,如果t&lt; 1然后程序将td.text元素追加到列表中。这保证了只添加一个元素。
# -*- coding: utf-8 -*-
import csv
# from urllib import urlopen
import urllib2
from bs4 import BeautifulSoup
import re
import time
import socket
fm = open('ShipURL.txt', 'r')
Shiplinks = fm.readlines() # Reads the line count from the first line.
check = ""
columnHeaders = ""
with open('ShipData.csv', 'wb')as f: # Creates an empty csv file to which assign values.
writer = csv.writer(f)
for line in Shiplinks:
time.sleep(2)
website = re.findall(r'(https?://\S+)', line)
website = "".join(str(x) for x in website)
print website
if check != website:
if website != "":
check = website
shipUrl = website
while True:
try:
shipPage = urllib2.urlopen(shipUrl, timeout=1.5)
except urllib2.URLError:
continue
except socket.timeout:
print "socket timeout!"
break
soup = BeautifulSoup(shipPage, "html.parser") # Read the web page HTML
table = soup.find_all("table", {"class": "table1"}) # Finds table with class table1
List = []
columnRow = ""
valueRow = ""
Values = []
for mytable in table: #Loops tables with class table1
table_body = mytable.find('tbody') #Finds tbody section in table
try: #If tbody exists
rows = table_body.find_all('tr') #Finds all rows
for tr in rows: #Loops rows
cols = tr.find_all('td') #Finds the columns
i = 1 #Variable to control the lines
t = 0
for td in cols: #Loops the columns
## print td.text #Displays the output
co = td.text #Saves the column to a variable
## writer.writerow([co]) Writes the variable in CSV file row
if i == 1: #Checks the control variable, if it equals to 1
if td.text[-1] == ":": # - : adds a ,
columnRow += td.text.strip(":") + " , " # One string
List.append(td.text.encode("utf-8")) #.. takes the column value and assigns it to a list called 'List' and..
i += 1 #..Increments i by one
else:
# võtab reavahetused maha ja lisab koma stringile
if t<1:
valueRow += td.text.strip("\n") + " , "
Values.append(td.text.strip("\n").encode("utf-8")) #Takes the second columns value and assigns it to a list called Values
t += 1
#print List #Checking stuff
#print Values #Checking stuff
except:
print "no tbody"
# Prints headings and values with row change.
print columnRow.strip(",")
print "\n"
print valueRow.strip(",")
# encode'ing hakkas jälle kiusama
# Kirjutab esimeseks reaks veeru pealkirjad ja teiseks väärtused
if not columnHeaders:
writer.writerow(List)
columnHeaders = columnRow
writer.writerow(Values)
#
fm.close()