使用python中的beautifulsoup解析表

时间:2016-09-14 04:58:16

标签: python web-scraping beautifulsoup

我想遍历每一行并捕获td.text的值。但是这里的问题是表没有类。并且所有td都有相同的类名。我想遍历每一行,并希望得到以下输出:

第1排)“AMERICANS SOCCER CLUB”,“B11EB - AMERICANS-B11EB-WARZALA”,“Cameron Coya”,“Player 228004”,“2016-09-10”,“玩家持续侵犯游戏规则” ,“C”(新行)

第二排)“AVIATORS SOCCER CLUB”,“G12DB - AVIATORS-G12DB-REYNGOUDT”,“Saskia Reyes”,“Player 224463”,“2016-09-11”,“玩家/子犯有违反体育行为”, “C”(新行)

<div style="overflow:auto; border:1px #cccccc solid;">
<table cellspacing="0" cellpadding="3" align="left" border="0" width="100%">
    <tbody>
        <tr class="tblHeading">
            <td colspan="7">AMERICANS SOCCER CLUB</td>
        </tr>
        <tr bgcolor="#CCE4F1">
            <td colspan="7">B11EB - AMERICANS-B11EB-WARZALA</td> 
        </tr>
        <tr bgcolor="#FFFFFF">
            <td width="19%" class="tdUnderLine"> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Cameron Coya                                       </td>
            <td width="19%" class="tdUnderLine">
                Rozel, Max
            </td>
            <td width="06%" class="tdUnderLine"> 
            09-11-2016
            </td>
            <td width="05%" class="tdUnderLine" align="center">         
                <a href="http://www.ncsanj.com/gameRefReportPrint.cfm?gid=228004" target="_blank">228004</a>    
            </td>
            <td width="16%" class="tdUnderLine" align="center"> 
                09/10/16 02:15 PM   
            </td>
            <td width="30%" class="tdUnderLine">                player persistently infringes the laws of the game   </td>
            <td class="tdUnderLine">                Cautioned    </td>
        </tr>
        <tr class="tblHeading">
            <td colspan="7">AVIATORS SOCCER CLUB</td>
        </tr>
        <tr bgcolor="#CCE4F1">
            <td colspan="7">G12DB - AVIATORS-G12DB-REYNGOUDT</td> 
        </tr>
        <tr bgcolor="#FBFBFB">
            <td width="19%" class="tdUnderLine"> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Saskia Reyes                                       </td>
            <td width="19%" class="tdUnderLine">
                HollaenderNardelli, Eric
            </td>
            <td width="06%" class="tdUnderLine"> 
            09-11-2016
            </td>
            <td width="05%" class="tdUnderLine" align="center">         

                <a href="http://www.ncsanj.com/gameRefReportPrint.cfm?gid=224463" target="_blank">224463</a>    
            </td>
            <td width="16%" class="tdUnderLine" align="center"> 
                09/11/16 06:45 PM   
            </td>
            <td width="30%" class="tdUnderLine">                player/sub guilty of unsporting behavior     </td>
            <td class="tdUnderLine">                Cautioned    </td>
        </tr>
        <tr class="tblHeading">
            <td colspan="7">BERGENFIELD SOCCER CLUB</td>
        </tr>
        <tr bgcolor="#CCE4F1">
            <td colspan="7">B11CW - BERGENFIELD-B11CW-NARVAEZ</td> 
        </tr>
        <tr bgcolor="#FFFFFF">
            <td width="19%" class="tdUnderLine"> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Christian Latorre                                  </td>
            <td width="19%" class="tdUnderLine">
                Coyle, Kevin
            </td>
            <td width="06%" class="tdUnderLine"> 
            09-10-2016
            </td>
            <td width="05%" class="tdUnderLine" align="center">         

                <a href="http://www.ncsanj.com/gameRefReportPrint.cfm?gid=226294" target="_blank">226294</a>    
            </td>
            <td width="16%" class="tdUnderLine" align="center"> 

                09/10/16 11:00 AM   

            </td>
            <td width="30%" class="tdUnderLine">                player persistently infringes the laws of the game   </td>
            <td class="tdUnderLine">                Cautioned    </td>
        </tr>

我尝试使用以下代码。

import requests
from bs4 import BeautifulSoup
import re
try:
    import urllib.request as urllib2
except ImportError:
    import urllib2

url = r"G:\Freelancer\NC Soccer\Northern Counties Soccer Association ©.html"
page = open(url, encoding="utf8")
soup = BeautifulSoup(page.read(),"html.parser")

#tableList = soup.findAll("table")

for tr in soup.find_all("tr"):
    for td in tr.find_all("td"):
        print(td.text.strip())

但很明显,它将返回所有td的文本形式,我将无法识别特定的列名称或无法确定新记录的开始。我想知道

1)如何识别每一列(因为类名相同)并且还有标题(如果您提供代码,我将不胜感激)

2)如何识别这种结构中的新记录

4 个答案:

答案 0 :(得分:1)

如果数据的结构非常像表格,那么您很有可能直接使用pd.read_table()将其读入pandas。请注意,它接受filepath_or_buffer参数中的url。 http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_table.html

答案 1 :(得分:1)

count = 0
string = ""
for td in soup.find_all("td"):
string += "\""+td.text.strip()+"\","
count +=1
if(count % 9 ==0):
    print string[:-1] + "\n\n" # string[:-1] to remove the last ","
    string = ""

由于表格不是正确的格式,我们只需要使用td,而不是进入每一行,然后进入每行中的td,这使得工作变得复杂。我只是使用了一个字符串,您可以将数据附加到列表列表中并进行处理以供以后使用 希望这能解决您的问题

答案 2 :(得分:0)

似乎有一种模式。在每7个tr(s)之后,有一个新的线。 所以,你可以做的是保持一个从1开始的计数器,当它触及7时,添加一个新行并重新启动它。

counter = 1
for tr in find_all("tr"):
    for td in tr.find_all("td"):
        # place code
    counter = counter + 1
    if counter == 7:
        print "\n"
        counter = 1

答案 3 :(得分:0)

from __future__ import print_function
import re
import datetime
from bs4 import BeautifulSoup

soup = ""
with open("/tmp/a.html") as page:
   soup = BeautifulSoup(page.read(),"html.parser")

table = soup.find('div', {'style': 'overflow:auto; border:1px #cccccc solid;'}).find('table')

trs = table.find_all('tr')

table_dict = {}
game = ""
section = ""

for tr in trs:
    if tr.has_attr('class'):
        game = tr.text.strip('\n')
    if tr.has_attr('bgcolor'):
        if tr['bgcolor'] == '#CCE4F1':
            section = tr.text.strip('\n')
        else:
            tds = tr.find_all('td')
            extracted_text = [re.sub(r'([^\x00-\x7F])+','', td.text) for td in tds]
            extracted_text = [x.strip() for x in extracted_text]
            extracted_text = list(filter(lambda x: len(x) > 2, extracted_text))
            extracted_text.pop(1)
            extracted_text[2] = "Player " + extracted_text[2]
            extracted_text[3] = datetime.datetime.strptime(extracted_text[3], '%m/%d/%y %I:%M %p').strftime("%Y-%m-%d")
            extracted_text = ['"' + x + '"' for x in [game, section] + extracted_text]
            print(','.join(extracted_text))

跑步时:

$ python a.py

"AMERICANS SOCCER CLUB","B11EB - AMERICANS-B11EB-WARZALA","Cameron Coya","Player 228004","2016-09-10","player persistently infringes the laws of the game","C"
"AVIATORS SOCCER CLUB","G12DB - AVIATORS-G12DB-REYNGOUDT","Saskia Reyes","Player 224463","2016-09-11","player/sub guilty of unsporting behavior","C"
"BERGENFIELD SOCCER CLUB","B11CW - BERGENFIELD-B11CW-NARVAEZ","Christian Latorre","Player 226294","2016-09-10","player persistently infringes the laws of the game","C"

根据与OP的进一步对话,输入为https://paste.fedoraproject.org/428111/87928814/raw/,运行上述代码后的输出为:https://paste.fedoraproject.org/428110/38792211/raw/