Question

我想遍历每一行并捕获td.text的值。但是这里的问题是表没有类。并且所有td都有相同的类名。我想遍历每一行，并希望得到以下输出：

第1排）“AMERICANS SOCCER CLUB”，“B11EB - AMERICANS-B11EB-WARZALA”，“Cameron Coya”，“Player 228004”，“2016-09-10”，“玩家持续侵犯游戏规则” ，“C”（新行）

第二排）“AVIATORS SOCCER CLUB”，“G12DB - AVIATORS-G12DB-REYNGOUDT”，“Saskia Reyes”，“Player 224463”，“2016-09-11”，“玩家/子犯有违反体育行为”， “C”（新行）

<div style="overflow:auto; border:1px #cccccc solid;">
<table cellspacing="0" cellpadding="3" align="left" border="0" width="100%">
    <tbody>
        <tr class="tblHeading">
            <td colspan="7">AMERICANS SOCCER CLUB</td>
        </tr>
        <tr bgcolor="#CCE4F1">
            <td colspan="7">B11EB - AMERICANS-B11EB-WARZALA</td> 
        </tr>
        <tr bgcolor="#FFFFFF">
            <td width="19%" class="tdUnderLine"> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Cameron Coya                                       </td>
            <td width="19%" class="tdUnderLine">
                Rozel, Max
            </td>
            <td width="06%" class="tdUnderLine"> 
            09-11-2016
            </td>
            <td width="05%" class="tdUnderLine" align="center">         
                <a href="http://www.ncsanj.com/gameRefReportPrint.cfm?gid=228004" target="_blank">228004</a>    
            </td>
            <td width="16%" class="tdUnderLine" align="center"> 
                09/10/16 02:15 PM   
            </td>
            <td width="30%" class="tdUnderLine">                player persistently infringes the laws of the game   </td>
            <td class="tdUnderLine">                Cautioned    </td>
        </tr>
        <tr class="tblHeading">
            <td colspan="7">AVIATORS SOCCER CLUB</td>
        </tr>
        <tr bgcolor="#CCE4F1">
            <td colspan="7">G12DB - AVIATORS-G12DB-REYNGOUDT</td> 
        </tr>
        <tr bgcolor="#FBFBFB">
            <td width="19%" class="tdUnderLine"> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Saskia Reyes                                       </td>
            <td width="19%" class="tdUnderLine">
                HollaenderNardelli, Eric
            </td>
            <td width="06%" class="tdUnderLine"> 
            09-11-2016
            </td>
            <td width="05%" class="tdUnderLine" align="center">         

                <a href="http://www.ncsanj.com/gameRefReportPrint.cfm?gid=224463" target="_blank">224463</a>    
            </td>
            <td width="16%" class="tdUnderLine" align="center"> 
                09/11/16 06:45 PM   
            </td>
            <td width="30%" class="tdUnderLine">                player/sub guilty of unsporting behavior     </td>
            <td class="tdUnderLine">                Cautioned    </td>
        </tr>
        <tr class="tblHeading">
            <td colspan="7">BERGENFIELD SOCCER CLUB</td>
        </tr>
        <tr bgcolor="#CCE4F1">
            <td colspan="7">B11CW - BERGENFIELD-B11CW-NARVAEZ</td> 
        </tr>
        <tr bgcolor="#FFFFFF">
            <td width="19%" class="tdUnderLine"> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Christian Latorre                                  </td>
            <td width="19%" class="tdUnderLine">
                Coyle, Kevin
            </td>
            <td width="06%" class="tdUnderLine"> 
            09-10-2016
            </td>
            <td width="05%" class="tdUnderLine" align="center">         

                <a href="http://www.ncsanj.com/gameRefReportPrint.cfm?gid=226294" target="_blank">226294</a>    
            </td>
            <td width="16%" class="tdUnderLine" align="center"> 

                09/10/16 11:00 AM   

            </td>
            <td width="30%" class="tdUnderLine">                player persistently infringes the laws of the game   </td>
            <td class="tdUnderLine">                Cautioned    </td>
        </tr>

我尝试使用以下代码。

import requests
from bs4 import BeautifulSoup
import re
try:
    import urllib.request as urllib2
except ImportError:
    import urllib2

url = r"G:\Freelancer\NC Soccer\Northern Counties Soccer Association ©.html"
page = open(url, encoding="utf8")
soup = BeautifulSoup(page.read(),"html.parser")

#tableList = soup.findAll("table")

for tr in soup.find_all("tr"):
    for td in tr.find_all("td"):
        print(td.text.strip())

但很明显，它将返回所有td的文本形式，我将无法识别特定的列名称或无法确定新记录的开始。我想知道

1）如何识别每一列（因为类名相同）并且还有标题（如果您提供代码，我将不胜感激）

2）如何识别这种结构中的新记录

Answer 1

如果数据的结构非常像表格，那么您很有可能直接使用pd.read_table（）将其读入pandas。请注意，它接受filepath_or_buffer参数中的url。 http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_table.html

Answer 2

count = 0
string = ""
for td in soup.find_all("td"):
string += "\""+td.text.strip()+"\","
count +=1
if(count % 9 ==0):
    print string[:-1] + "\n\n" # string[:-1] to remove the last ","
    string = ""

由于表格不是正确的格式，我们只需要使用td，而不是进入每一行，然后进入每行中的td，这使得工作变得复杂。我只是使用了一个字符串，您可以将数据附加到列表列表中并进行处理以供以后使用希望这能解决您的问题

Answer 3

似乎有一种模式。在每7个tr（s）之后，有一个新的线。所以，你可以做的是保持一个从1开始的计数器，当它触及7时，添加一个新行并重新启动它。

counter = 1
for tr in find_all("tr"):
    for td in tr.find_all("td"):
        # place code
    counter = counter + 1
    if counter == 7:
        print "\n"
        counter = 1

Answer 4

from __future__ import print_function
import re
import datetime
from bs4 import BeautifulSoup

soup = ""
with open("/tmp/a.html") as page:
   soup = BeautifulSoup(page.read(),"html.parser")

table = soup.find('div', {'style': 'overflow:auto; border:1px #cccccc solid;'}).find('table')

trs = table.find_all('tr')

table_dict = {}
game = ""
section = ""

for tr in trs:
    if tr.has_attr('class'):
        game = tr.text.strip('\n')
    if tr.has_attr('bgcolor'):
        if tr['bgcolor'] == '#CCE4F1':
            section = tr.text.strip('\n')
        else:
            tds = tr.find_all('td')
            extracted_text = [re.sub(r'([^\x00-\x7F])+','', td.text) for td in tds]
            extracted_text = [x.strip() for x in extracted_text]
            extracted_text = list(filter(lambda x: len(x) > 2, extracted_text))
            extracted_text.pop(1)
            extracted_text[2] = "Player " + extracted_text[2]
            extracted_text[3] = datetime.datetime.strptime(extracted_text[3], '%m/%d/%y %I:%M %p').strftime("%Y-%m-%d")
            extracted_text = ['"' + x + '"' for x in [game, section] + extracted_text]
            print(','.join(extracted_text))

跑步时：

$ python a.py

"AMERICANS SOCCER CLUB","B11EB - AMERICANS-B11EB-WARZALA","Cameron Coya","Player 228004","2016-09-10","player persistently infringes the laws of the game","C"
"AVIATORS SOCCER CLUB","G12DB - AVIATORS-G12DB-REYNGOUDT","Saskia Reyes","Player 224463","2016-09-11","player/sub guilty of unsporting behavior","C"
"BERGENFIELD SOCCER CLUB","B11CW - BERGENFIELD-B11CW-NARVAEZ","Christian Latorre","Player 226294","2016-09-10","player persistently infringes the laws of the game","C"

根据与OP的进一步对话，输入为https://paste.fedoraproject.org/428111/87928814/raw/，运行上述代码后的输出为：https://paste.fedoraproject.org/428110/38792211/raw/

使用python中的beautifulsoup解析表

4 个答案: