我正在尝试使用beautifulsoup解析表。我页面上的第一个很容易,但我无法在同一页面上解析类似的表格。我不明白为什么。
这是代码。在此先感谢您的帮助。
import urllib2
from bs4 import BeautifulSoup
url = urllib2.urlopen("https://dl.dropboxusercontent.com/u/956261/poftext.html")
contentHTML = url.read()
soup = BeautifulSoup(contentHTML)
tableUserDetails = soup.find("table", {"class" : "user-details"})
i = 0
tableUserDetailsList = []
for row in tableUserDetails.findAll('tr'):
for col in row.findAll('td'):
contentTd = col.contents[0].string.strip()
if contentTd:
print "TD Number %d : %s" % (i, contentTd)
tableUserDetailsList.append(contentTd)
i += 1
# This first table is OK
print tableUserDetailsList
# But now this one
tableUserDetails = soup.find("table", {"class" : "secondpart"})
i = 0
tableUserDetailsList = []
for row in tableUserDetails.findAll('tr'):
for col in row.findAll('td'):
contentTd = col.contents[0].string.strip()
if contentTd:
print "TD Number %d : %s" % (i, contentTd)
tableUserDetailsList.append(contentTd)
i += 1
print tableUserDetailsList
# The list is empty :(
以下是我要解析的HTML代码的简化版本:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title>
French.Kiss
Sorties, Sport, Voyages, Nouvelles Expériences</title>
</head>
<body style='background-color: #fff;' leftMargin='0' topMargin='0' marginwidth='0' marginheight='0' link='#1E55D6' vlink='#1E55D6' TEXT='#6551b0'>
<table class="user-details">
<tr>
<td class="headline txtBlue size15" style="width:80px">
About
</td>
<td style="width:10px">
</td>
<td class="txtGrey size15">
Fume occasionnellement with Silhouette mince
</td>
<td width="25px;">
</td>
<td class="headline txtBlue size15">
City
</td>
<td class="txtGrey size15">
Paris Ile-de-France
</td>
</tr>
<tr>
<td class="headline txtBlue size15">
Details
</td>
<td style="width:10px">
</td>
<td class="txtGrey size15">
26 year old Un homme, 185cm, Sans religion
</td>
<td>
</td>
<td class="headline txtBlue size15">
Ethnicity
</td>
<td class="txtGrey size15">
Caucasienne Balance with Châtains
</td>
</tr>
<tr>
<td class="headline txtBlue size15">
Intent
</td>
<td style="width:10px">
</td>
<td class="txtGrey size15">
French.Kiss Cherche une relation amoureuse.
</td>
<td>
</td>
<td class="headline txtBlue size15" style="width:90px">
Education
</td>
<td class="txtGrey size15">
Diplôme universitaire/Licence
</td>
</tr>
<tr>
<td class="headline txtBlue size15">
Personnalité
</td>
<td style="width:10px">
</td>
<td class="txtGrey size15">
</td> <td>
</td>
<td>
<span class="headline txtBlue size15">Profession </span>
</td>
<td>
<span class="txtGrey size15">
Visioconférence</span>
</td>
</tr>
</table>
<table width="85%" class="secondpart">
<tr height="25px">
<td width="200px">
<span class="headline txtBlue size14">I am Seeking a</span>
</td>
<td width="300px">
<span class="txtGrey size14">
Une femme</span>
</td>
<td width="25px">
</td>
<td width="200px">
<span class="headline txtBlue size14">For</span>
</td>
<td width="200px">
<span class="txtGrey size14">
Sorties</span>
</td>
</tr>
<tr height="25px">
<td>
<span class="headline txtBlue size14"><a href='needs_test.aspx'>Needs Test</a></span>
</td>
<td>
<span class="txtGrey size14"><a href='needs_test.aspx'>
<a href="needs_view.aspx?id=38028200">View
his
relationship needs</a></a></span>
</td>
<td>
</td>
<td>
<span class="headline txtBlue size14"><a href='poftest.aspx'>Chemistry</a></span>
</td>
<td>
<span class="txtGrey size14"><a href='poftest.aspx'>
<a href="personality.aspx?id=26&user_id=41724176" rel="nofollow">View
his
chemistry results</a></a></span>
</td>
</tr>
<tr height="25px">
<td>
<span class="headline txtBlue size14">Do you drink?</span>
</td>
<td>
<span class="txtGrey size14">
Occasionnellement</span>
</td>
<td>
</td>
<td>
<span class="headline txtBlue size14">Do you want children?</span>
</td>
<td>
<span class="txtGrey size14">
Non divulgué</span>
</td>
</tr>
<tr height="25px">
<td>
<span class="headline txtBlue size14">Marital Status</span>
</td>
<td>
<span class="txtGrey size14">
Célibataire</span>
</td>
<td>
</td>
<td>
<span class="headline txtBlue size14">Do you do drugs?</span>
</td>
<td>
<span class="txtGrey size14">
Non</span>
</td>
</tr>
<tr height="25px">
<td>
<span class="headline txtBlue size14">Pets </span>
</td>
<td>
<span class="txtGrey size14">
Aucun</span>
</td>
<td>
</td>
<td>
<span class="headline txtBlue size14">Eye Color</span>
</td>
<td>
<span class="txtGrey size14">
Bruns</span>
</td>
</tr>
<tr height="25px">
<td>
<span class="headline txtBlue size14">Do you have a car? </span>
</td>
<td>
<span class="txtGrey size14">
N/A</span>
</td>
<td>
</td>
<td>
<span class="headline txtBlue size14">Do you have children?</span>
</td>
<td>
<span class="txtGrey size14">
Non</span>
</td>
</tr>
<tr height="25px">
<td>
<span class="headline txtBlue size14">Longest Relationship</span>
</td>
<td>
<span class="txtGrey size14">
Plus de 2 ans</span>
</td>
<td>
</td>
<td>
</td>
<td>
</td>
</tr>
</table>
</body>
</html>
两个表的tableUserDetails.content,tableUserDetails和tableUserDetailsList:
* FIRST TABLE *
print tableUserDetails.content = none
print tableUserDetails =
<table class="user-details">
<tr>
<td class="headline txtBlue size15" style="width:80px">
About
</td>
<td style="width:10px">
</td>
<td class="txtGrey size15">
Fume occasionnellement with Silhouette mince
</td>
<td width="25px;">
</td>
<td class="headline txtBlue size15">
City
</td>
<td class="txtGrey size15">
Paris Ile-de-France
</td>
</tr>
<tr>
<td class="headline txtBlue size15">
Details
</td>
<td style="width:10px">
</td>
<td class="txtGrey size15">
26 year old Un homme, 185cm, Sans religion
</td>
<td>
</td>
<td class="headline txtBlue size15">
Ethnicity
</td>
<td class="txtGrey size15">
Caucasienne Balance with Châtains
</td>
</tr>
<tr>
<td class="headline txtBlue size15">
Intent
</td>
<td style="width:10px">
</td>
<td class="txtGrey size15">
French.Kiss Cherche une relation amoureuse.
</td>
<td>
</td>
<td class="headline txtBlue size15" style="width:90px">
Education
</td>
<td class="txtGrey size15">
Diplôme universitaire/Licence
</td>
</tr>
<tr>
<td class="headline txtBlue size15">
Personnalité
</td>
<td style="width:10px">
</td>
<td class="txtGrey size15">
</td> <td>
</td>
<td>
<span class="headline txtBlue size15">Profession </span>
</td>
<td>
<span class="txtGrey size15">
Visioconférence</span>
</td>
</tr>
</table>
print tableUserDetailsList = [u'About',u'Fume occasionnellement with Silhouette mince',u'City',u'Paris Ile-de-France',u'Details',u'26岁Un Un homme,185cm ,Sans religion',u'Ethnic ity',u'Caucasienne Balance with Ch \ xe2tains',u'Intent',u'French.Kiss Cherche 无关紧要。',u'Education',u'Dipl \ xf4me universitaire / License',u'P ersonnalit \ xe9' ]
* SECOND TABLE *
print tableUserDetails.content = none
print tableUserDetails =
<table width="85%" class="secondpart">
<tr height="25px">
<td width="200px">
<span class="headline txtBlue size14">I am Seeking a</span>
</td>
<td width="300px">
<span class="txtGrey size14">
Une femme</span>
</td>
<td width="25px">
</td>
<td width="200px">
<span class="headline txtBlue size14">For</span>
</td>
<td width="200px">
<span class="txtGrey size14">
Sorties</span>
</td>
</tr>
<tr height="25px">
<td>
<span class="headline txtBlue size14"><a href='needs_test.aspx'>Needs Test</a></span>
</td>
<td>
<span class="txtGrey size14"><a href='needs_test.aspx'>
<a href="needs_view.aspx?id=38028200">View
his
relationship needs</a></a></span>
</td>
<td>
</td>
<td>
<span class="headline txtBlue size14"><a href='poftest.aspx'>Chemistry</a></span>
</td>
<td>
<span class="txtGrey size14"><a href='poftest.aspx'>
<a href="personality.aspx?id=26&user_id=41724176" rel="nofollow">View
his
chemistry results</a></a></span>
</td>
</tr>
<tr height="25px">
<td>
<span class="headline txtBlue size14">Do you drink?</span>
</td>
<td>
<span class="txtGrey size14">
Occasionnellement</span>
</td>
<td>
</td>
<td>
<span class="headline txtBlue size14">Do you want children?</span>
</td>
<td>
<span class="txtGrey size14">
Non divulgué</span>
</td>
</tr>
<tr height="25px">
<td>
<span class="headline txtBlue size14">Marital Status</span>
</td>
<td>
<span class="txtGrey size14">
Célibataire</span>
</td>
<td>
</td>
<td>
<span class="headline txtBlue size14">Do you do drugs?</span>
</td>
<td>
<span class="txtGrey size14">
Non</span>
</td>
</tr>
<tr height="25px">
<td>
<span class="headline txtBlue size14">Pets </span>
</td>
<td>
<span class="txtGrey size14">
Aucun</span>
</td>
<td>
</td>
<td>
<span class="headline txtBlue size14">Eye Color</span>
</td>
<td>
<span class="txtGrey size14">
Bruns</span>
</td>
</tr>
<tr height="25px">
<td>
<span class="headline txtBlue size14">Do you have a car? </span>
</td>
<td>
<span class="txtGrey size14">
N/A</span>
</td>
<td>
</td>
<td>
<span class="headline txtBlue size14">Do you have children?</span>
</td>
<td>
<span class="txtGrey size14">
Non</span>
</td>
</tr>
<tr height="25px">
<td>
<span class="headline txtBlue size14">Longest Relationship</span>
</td>
<td>
<span class="txtGrey size14">
Plus de 2 ans</span>
</td>
<td>
</td>
<td>
</td>
<td>
</td>
</tr>
</table>
print tableUserDetailsList = []
答案 0 :(得分:1)
这有效:
tableUserDetailsList = []
for row in tableUserDetails.findAll('tr'):
for col in row.findAll('td'):
contents = list(col.stripped_strings)
if contents:
contentTd = contents[0]
print "TD Number %d : %s" % (i, contentTd)
tableUserDetailsList.append(contentTd)
i += 1
问题是您的第二个表格包含spans
。 span
之前的换行符也被解释为内容并在col.contents
列表中返回。
它也适用于第一个表。正如Anubhav评论的那样,你应该考虑迭代这些表,而不是两次使用相同的代码。
答案 1 :(得分:0)
而是使用table = soup.find(&#39; table&#39;)
使用table = soup.find_all(&#39; table&#39;)
这将返回html中的表格列表,然后您可以从列表中选择正确的表格。