我想遍历每一行并捕获td.text的值。但是这里的问题是表没有类。并且所有td都有相同的类名。我想遍历每一行,并希望得到以下输出:
第1排)“AMERICANS SOCCER CLUB”,“B11EB - AMERICANS-B11EB-WARZALA”,“Cameron Coya”,“Player 228004”,“2016-09-10”,“玩家持续侵犯游戏规则” ,“C”(新行)
第二排)“AVIATORS SOCCER CLUB”,“G12DB - AVIATORS-G12DB-REYNGOUDT”,“Saskia Reyes”,“Player 224463”,“2016-09-11”,“玩家/子犯有违反体育行为”, “C”(新行)
<div style="overflow:auto; border:1px #cccccc solid;">
<table cellspacing="0" cellpadding="3" align="left" border="0" width="100%">
<tbody>
<tr class="tblHeading">
<td colspan="7">AMERICANS SOCCER CLUB</td>
</tr>
<tr bgcolor="#CCE4F1">
<td colspan="7">B11EB - AMERICANS-B11EB-WARZALA</td>
</tr>
<tr bgcolor="#FFFFFF">
<td width="19%" class="tdUnderLine"> Cameron Coya </td>
<td width="19%" class="tdUnderLine">
Rozel, Max
</td>
<td width="06%" class="tdUnderLine">
09-11-2016
</td>
<td width="05%" class="tdUnderLine" align="center">
<a href="http://www.ncsanj.com/gameRefReportPrint.cfm?gid=228004" target="_blank">228004</a>
</td>
<td width="16%" class="tdUnderLine" align="center">
09/10/16 02:15 PM
</td>
<td width="30%" class="tdUnderLine"> player persistently infringes the laws of the game </td>
<td class="tdUnderLine"> Cautioned </td>
</tr>
<tr class="tblHeading">
<td colspan="7">AVIATORS SOCCER CLUB</td>
</tr>
<tr bgcolor="#CCE4F1">
<td colspan="7">G12DB - AVIATORS-G12DB-REYNGOUDT</td>
</tr>
<tr bgcolor="#FBFBFB">
<td width="19%" class="tdUnderLine"> Saskia Reyes </td>
<td width="19%" class="tdUnderLine">
HollaenderNardelli, Eric
</td>
<td width="06%" class="tdUnderLine">
09-11-2016
</td>
<td width="05%" class="tdUnderLine" align="center">
<a href="http://www.ncsanj.com/gameRefReportPrint.cfm?gid=224463" target="_blank">224463</a>
</td>
<td width="16%" class="tdUnderLine" align="center">
09/11/16 06:45 PM
</td>
<td width="30%" class="tdUnderLine"> player/sub guilty of unsporting behavior </td>
<td class="tdUnderLine"> Cautioned </td>
</tr>
<tr class="tblHeading">
<td colspan="7">BERGENFIELD SOCCER CLUB</td>
</tr>
<tr bgcolor="#CCE4F1">
<td colspan="7">B11CW - BERGENFIELD-B11CW-NARVAEZ</td>
</tr>
<tr bgcolor="#FFFFFF">
<td width="19%" class="tdUnderLine"> Christian Latorre </td>
<td width="19%" class="tdUnderLine">
Coyle, Kevin
</td>
<td width="06%" class="tdUnderLine">
09-10-2016
</td>
<td width="05%" class="tdUnderLine" align="center">
<a href="http://www.ncsanj.com/gameRefReportPrint.cfm?gid=226294" target="_blank">226294</a>
</td>
<td width="16%" class="tdUnderLine" align="center">
09/10/16 11:00 AM
</td>
<td width="30%" class="tdUnderLine"> player persistently infringes the laws of the game </td>
<td class="tdUnderLine"> Cautioned </td>
</tr>
我尝试使用以下代码。
import requests
from bs4 import BeautifulSoup
import re
try:
import urllib.request as urllib2
except ImportError:
import urllib2
url = r"G:\Freelancer\NC Soccer\Northern Counties Soccer Association ©.html"
page = open(url, encoding="utf8")
soup = BeautifulSoup(page.read(),"html.parser")
#tableList = soup.findAll("table")
for tr in soup.find_all("tr"):
for td in tr.find_all("td"):
print(td.text.strip())
但很明显,它将返回所有td的文本形式,我将无法识别特定的列名称或无法确定新记录的开始。我想知道
1)如何识别每一列(因为类名相同)并且还有标题(如果您提供代码,我将不胜感激)
2)如何识别这种结构中的新记录
答案 0 :(得分:1)
如果数据的结构非常像表格,那么您很有可能直接使用pd.read_table()将其读入pandas。请注意,它接受filepath_or_buffer参数中的url。 http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_table.html
答案 1 :(得分:1)
count = 0
string = ""
for td in soup.find_all("td"):
string += "\""+td.text.strip()+"\","
count +=1
if(count % 9 ==0):
print string[:-1] + "\n\n" # string[:-1] to remove the last ","
string = ""
由于表格不是正确的格式,我们只需要使用td,而不是进入每一行,然后进入每行中的td,这使得工作变得复杂。我只是使用了一个字符串,您可以将数据附加到列表列表中并进行处理以供以后使用 希望这能解决您的问题
答案 2 :(得分:0)
似乎有一种模式。在每7个tr(s)之后,有一个新的线。 所以,你可以做的是保持一个从1开始的计数器,当它触及7时,添加一个新行并重新启动它。
counter = 1
for tr in find_all("tr"):
for td in tr.find_all("td"):
# place code
counter = counter + 1
if counter == 7:
print "\n"
counter = 1
答案 3 :(得分:0)
from __future__ import print_function
import re
import datetime
from bs4 import BeautifulSoup
soup = ""
with open("/tmp/a.html") as page:
soup = BeautifulSoup(page.read(),"html.parser")
table = soup.find('div', {'style': 'overflow:auto; border:1px #cccccc solid;'}).find('table')
trs = table.find_all('tr')
table_dict = {}
game = ""
section = ""
for tr in trs:
if tr.has_attr('class'):
game = tr.text.strip('\n')
if tr.has_attr('bgcolor'):
if tr['bgcolor'] == '#CCE4F1':
section = tr.text.strip('\n')
else:
tds = tr.find_all('td')
extracted_text = [re.sub(r'([^\x00-\x7F])+','', td.text) for td in tds]
extracted_text = [x.strip() for x in extracted_text]
extracted_text = list(filter(lambda x: len(x) > 2, extracted_text))
extracted_text.pop(1)
extracted_text[2] = "Player " + extracted_text[2]
extracted_text[3] = datetime.datetime.strptime(extracted_text[3], '%m/%d/%y %I:%M %p').strftime("%Y-%m-%d")
extracted_text = ['"' + x + '"' for x in [game, section] + extracted_text]
print(','.join(extracted_text))
跑步时:
$ python a.py
"AMERICANS SOCCER CLUB","B11EB - AMERICANS-B11EB-WARZALA","Cameron Coya","Player 228004","2016-09-10","player persistently infringes the laws of the game","C"
"AVIATORS SOCCER CLUB","G12DB - AVIATORS-G12DB-REYNGOUDT","Saskia Reyes","Player 224463","2016-09-11","player/sub guilty of unsporting behavior","C"
"BERGENFIELD SOCCER CLUB","B11CW - BERGENFIELD-B11CW-NARVAEZ","Christian Latorre","Player 226294","2016-09-10","player persistently infringes the laws of the game","C"
根据与OP的进一步对话,输入为https://paste.fedoraproject.org/428111/87928814/raw/,运行上述代码后的输出为:https://paste.fedoraproject.org/428110/38792211/raw/