即使第一个表工作,也无法使用beautifulsoup解析第二个表?

时间:2013-05-15 06:14:46

标签: python parsing html-parsing beautifulsoup

我正在尝试使用beautifulsoup解析表。我页面上的第一个很容易,但我无法在同一页面上解析类似的表格。我不明白为什么。

这是代码。在此先感谢您的帮助。

import urllib2
from bs4 import BeautifulSoup


url = urllib2.urlopen("https://dl.dropboxusercontent.com/u/956261/poftext.html")
contentHTML = url.read()

soup = BeautifulSoup(contentHTML)

tableUserDetails = soup.find("table", {"class" : "user-details"})

i = 0
tableUserDetailsList = []
for row in tableUserDetails.findAll('tr'):
    for col in row.findAll('td'):
        contentTd = col.contents[0].string.strip()

        if contentTd:
            print "TD Number %d : %s" % (i, contentTd)
            tableUserDetailsList.append(contentTd)
            i += 1

# This first table is OK
print tableUserDetailsList


# But now this one
tableUserDetails = soup.find("table", {"class" : "secondpart"})

i = 0
tableUserDetailsList = []
for row in tableUserDetails.findAll('tr'):
    for col in row.findAll('td'):
        contentTd = col.contents[0].string.strip()

        if contentTd:
            print "TD Number %d : %s" % (i, contentTd)
            tableUserDetailsList.append(contentTd)
            i += 1

print tableUserDetailsList

# The list is empty :(

以下是我要解析的HTML代码的简化版本:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
    <title>
        French.Kiss
        Sorties, Sport, Voyages, Nouvelles Expériences</title> 

</head>
<body style='background-color: #fff;' leftMargin='0' topMargin='0' marginwidth='0' marginheight='0' link='#1E55D6' vlink='#1E55D6'  TEXT='#6551b0'>

            <table class="user-details">
                <tr>
                    <td class="headline txtBlue size15" style="width:80px">
                        About
                    </td>
                    <td style="width:10px">
                        &nbsp;
                    </td>
                    <td class="txtGrey size15">
                        Fume occasionnellement with Silhouette mince
                    </td>
                    <td width="25px;">
                        &nbsp;
                    </td>
                    <td class="headline txtBlue size15">
                        City
                    </td>
                    <td class="txtGrey size15">
                        Paris Ile-de-France
                    </td>
                </tr>
                <tr>
                    <td class="headline txtBlue size15">
                        Details
                    </td>
                    <td style="width:10px">
                        &nbsp;
                    </td>
                    <td class="txtGrey size15">
                        26 year old Un homme, 185cm, Sans religion
                    </td>
                    <td>
                    </td>
                    <td class="headline txtBlue size15">
                        Ethnicity
                    </td>
                    <td class="txtGrey size15">
                        Caucasienne Balance with Châtains
                    </td>
                </tr>
                <tr>
                    <td class="headline txtBlue size15">
                        Intent
                    </td>
                    <td style="width:10px">
                        &nbsp;
                    </td>
                    <td class="txtGrey size15">
                        French.Kiss Cherche une relation amoureuse.
                    </td>
                    <td>
                    </td>
                    <td class="headline txtBlue size15" style="width:90px">
                        Education
                    </td>
                    <td class="txtGrey size15">
                        Diplôme universitaire/Licence
                    </td>
                </tr>

                <tr>
                    <td class="headline txtBlue size15">
                        Personnalité
                    </td>
                    <td style="width:10px">
                        &nbsp;
                    </td>
                    <td class="txtGrey size15">

                    </td>   <td>
                    </td>
                <td>
                            <span class="headline txtBlue size15">Profession </span>
                        </td>
                        <td>
                            <span class="txtGrey size15">
                                Visioconférence</span>
                        </td>
                </tr>

            </table> 





















                <table width="85%" class="secondpart">
                    <tr height="25px">
                        <td width="200px">
                            <span class="headline txtBlue size14">I am Seeking a</span>
                        </td>
                        <td width="300px">
                            <span class="txtGrey size14">
                                Une femme</span>
                        </td>
                        <td width="25px">
                        </td>
                        <td width="200px">
                            <span class="headline txtBlue size14">For</span>
                        </td>
                        <td width="200px">
                            <span class="txtGrey size14">
                                Sorties</span>
                        </td>
                    </tr>
                    <tr height="25px">
                        <td>
                            <span class="headline txtBlue size14"><a href='needs_test.aspx'>Needs Test</a></span>
                        </td>
                        <td>
                            <span class="txtGrey size14"><a href='needs_test.aspx'>


                                <a href="needs_view.aspx?id=38028200">View
                                    his
                                    relationship needs</a></a></span>
                        </td>
                        <td>
                        </td>
                        <td>
                            <span class="headline txtBlue size14"><a href='poftest.aspx'>Chemistry</a></span>
                        </td>
                        <td>
                            <span class="txtGrey size14"><a href='poftest.aspx'>

                                <a href="personality.aspx?id=26&user_id=41724176" rel="nofollow">View
                                    his
                                    chemistry results</a></a></span>
                        </td>
                    </tr>
                    <tr height="25px">
                        <td>
                            <span class="headline txtBlue size14">Do you drink?</span>
                        </td>
                        <td>
                            <span class="txtGrey size14">
                                Occasionnellement</span>
                        </td>
                        <td>
                        </td>
                        <td>
                            <span class="headline txtBlue size14">Do you want children?</span>
                        </td>
                        <td>
                            <span class="txtGrey size14">
                                Non divulgué</span>
                        </td>
                    </tr>
                    <tr height="25px">
                        <td>
                            <span class="headline txtBlue size14">Marital Status</span>
                        </td>
                        <td>
                            <span class="txtGrey size14">
                                Célibataire</span>
                        </td>
                        <td>
                        </td>
                        <td>
                            <span class="headline txtBlue size14">Do you do drugs?</span>
                        </td>
                        <td>
                            <span class="txtGrey size14">
                                Non</span>
                        </td>
                    </tr>

                    <tr height="25px">
                        <td>
                            <span class="headline txtBlue size14">Pets </span>
                        </td>
                        <td>
                            <span class="txtGrey size14">
                                Aucun</span>
                        </td>
                        <td>
                        </td>
                        <td>
                            <span class="headline txtBlue size14">Eye Color</span>
                        </td>
                        <td>
                            <span class="txtGrey size14">
                                Bruns</span>
                        </td>
                    </tr>

                    <tr height="25px">
                        <td>
                            <span class="headline txtBlue size14">Do you have a car? </span>
                        </td>
                        <td>
                            <span class="txtGrey size14">
                                N/A</span>
                        </td>
                        <td>
                        </td>
                        <td>
                            <span class="headline txtBlue size14">Do you have children?</span>
                        </td>
                        <td>
                            <span class="txtGrey size14">
                                Non</span>
                        </td>
                    </tr>
                    <tr height="25px">
                        <td>
                         <span class="headline txtBlue size14">Longest Relationship</span>
                        </td>

                        <td>
                            <span class="txtGrey size14">
                                Plus de 2 ans</span>
                        </td>
                        <td>
                        </td>
                        <td>

                        </td>
                        <td>

                        </td>
                    </tr>

                </table> 
</body>
</html>

两个表的tableUserDetails.content,tableUserDetails和tableUserDetailsList:

* FIRST TABLE *

print tableUserDetails.content = none

print tableUserDetails =

  <table class="user-details">
                <tr>
                    <td class="headline txtBlue size15" style="width:80px">
                        About
                    </td>
                    <td style="width:10px">
                        &nbsp;
                    </td>
                    <td class="txtGrey size15">
                        Fume occasionnellement with Silhouette mince
                    </td>
                    <td width="25px;">
                        &nbsp;
                    </td>
                    <td class="headline txtBlue size15">
                        City
                    </td>
                    <td class="txtGrey size15">
                        Paris Ile-de-France
                    </td>
                </tr>
                <tr>
                    <td class="headline txtBlue size15">
                        Details
                    </td>
                    <td style="width:10px">
                        &nbsp;
                    </td>
                    <td class="txtGrey size15">
                        26 year old Un homme, 185cm, Sans religion
                    </td>
                    <td>
                    </td>
                    <td class="headline txtBlue size15">
                        Ethnicity
                    </td>
                    <td class="txtGrey size15">
                        Caucasienne Balance with Châtains
                    </td>
                </tr>
                <tr>
                    <td class="headline txtBlue size15">
                        Intent
                    </td>
                    <td style="width:10px">
                        &nbsp;
                    </td>
                    <td class="txtGrey size15">
                        French.Kiss Cherche une relation amoureuse.
                    </td>
                    <td>
                    </td>
                    <td class="headline txtBlue size15" style="width:90px">
                        Education
                    </td>
                    <td class="txtGrey size15">
                        Diplôme universitaire/Licence
                    </td>
                </tr>

                <tr>
                    <td class="headline txtBlue size15">
                        Personnalité
                    </td>
                    <td style="width:10px">
                        &nbsp;
                    </td>
                    <td class="txtGrey size15">

                    </td>   <td>
                    </td>
                <td>
                            <span class="headline txtBlue size15">Profession </span>
                        </td>
                        <td>
                            <span class="txtGrey size15">
                                Visioconférence</span>
                        </td>
                </tr>

            </table> 

print tableUserDetailsList = [u'About',u'Fume occasionnellement with Silhouette mince',u'City',u'Paris Ile-de-France',u'Details',u'26岁Un Un homme,185cm ,Sans religion',u'Ethnic ity',u'Caucasienne Balance with Ch \ xe2tains',u'Intent',u'French.Kiss Cherche 无关紧要。',u'Education',u'Dipl \ xf4me universitaire / License',u'P ersonnalit \ xe9' ]

* SECOND TABLE *

print tableUserDetails.content = none

print tableUserDetails =

 <table width="85%" class="secondpart">
                    <tr height="25px">
                        <td width="200px">
                            <span class="headline txtBlue size14">I am Seeking a</span>
                        </td>
                        <td width="300px">
                            <span class="txtGrey size14">
                                Une femme</span>
                        </td>
                        <td width="25px">
                        </td>
                        <td width="200px">
                            <span class="headline txtBlue size14">For</span>
                        </td>
                        <td width="200px">
                            <span class="txtGrey size14">
                                Sorties</span>
                        </td>
                    </tr>
                    <tr height="25px">
                        <td>
                            <span class="headline txtBlue size14"><a href='needs_test.aspx'>Needs Test</a></span>
                        </td>
                        <td>
                            <span class="txtGrey size14"><a href='needs_test.aspx'>


                                <a href="needs_view.aspx?id=38028200">View
                                    his
                                    relationship needs</a></a></span>
                        </td>
                        <td>
                        </td>
                        <td>
                            <span class="headline txtBlue size14"><a href='poftest.aspx'>Chemistry</a></span>
                        </td>
                        <td>
                            <span class="txtGrey size14"><a href='poftest.aspx'>

                                <a href="personality.aspx?id=26&user_id=41724176" rel="nofollow">View
                                    his
                                    chemistry results</a></a></span>
                        </td>
                    </tr>
                    <tr height="25px">
                        <td>
                            <span class="headline txtBlue size14">Do you drink?</span>
                        </td>
                        <td>
                            <span class="txtGrey size14">
                                Occasionnellement</span>
                        </td>
                        <td>
                        </td>
                        <td>
                            <span class="headline txtBlue size14">Do you want children?</span>
                        </td>
                        <td>
                            <span class="txtGrey size14">
                                Non divulgué</span>
                        </td>
                    </tr>
                    <tr height="25px">
                        <td>
                            <span class="headline txtBlue size14">Marital Status</span>
                        </td>
                        <td>
                            <span class="txtGrey size14">
                                Célibataire</span>
                        </td>
                        <td>
                        </td>
                        <td>
                            <span class="headline txtBlue size14">Do you do drugs?</span>
                        </td>
                        <td>
                            <span class="txtGrey size14">
                                Non</span>
                        </td>
                    </tr>

                    <tr height="25px">
                        <td>
                            <span class="headline txtBlue size14">Pets </span>
                        </td>
                        <td>
                            <span class="txtGrey size14">
                                Aucun</span>
                        </td>
                        <td>
                        </td>
                        <td>
                            <span class="headline txtBlue size14">Eye Color</span>
                        </td>
                        <td>
                            <span class="txtGrey size14">
                                Bruns</span>
                        </td>
                    </tr>

                    <tr height="25px">
                        <td>
                            <span class="headline txtBlue size14">Do you have a car? </span>
                        </td>
                        <td>
                            <span class="txtGrey size14">
                                N/A</span>
                        </td>
                        <td>
                        </td>
                        <td>
                            <span class="headline txtBlue size14">Do you have children?</span>
                        </td>
                        <td>
                            <span class="txtGrey size14">
                                Non</span>
                        </td>
                    </tr>
                    <tr height="25px">
                        <td>
                         <span class="headline txtBlue size14">Longest Relationship</span>
                        </td>

                        <td>
                            <span class="txtGrey size14">
                                Plus de 2 ans</span>
                        </td>
                        <td>
                        </td>
                        <td>

                        </td>
                        <td>

                        </td>
                    </tr>

                </table> 

print tableUserDetailsList = []

2 个答案:

答案 0 :(得分:1)

这有效:

tableUserDetailsList = []
for row in tableUserDetails.findAll('tr'):
    for col in row.findAll('td'):
        contents = list(col.stripped_strings)
        if contents:
            contentTd = contents[0]
            print "TD Number %d : %s" % (i, contentTd)
            tableUserDetailsList.append(contentTd)
            i += 1

问题是您的第二个表格包含spansspan之前的换行符也被解释为内容并在col.contents列表中返回。

它也适用于第一个表。正如Anubhav评论的那样,你应该考虑迭代这些表,而不是两次使用相同的代码。

答案 1 :(得分:0)

而是使用table = soup.find(&#39; table&#39;)

使用table = soup.find_all(&#39; table&#39;)

这将返回html中的表格列表,然后您可以从列表中选择正确的表格。