Python WebScraping-尝试在表中查找行

时间:2018-10-23 16:56:41

标签: python web-scraping beautifulsoup

尝试对表进行web爬网,在该表中我要从表中提取大多数td数据。我可以从行中获取一些信息,但不能正确获取各个tds。我需要做什么才能提取td数据?我需要在tds中获取数据,其名称类似于standing-table__cell,或者我可以只获取所有tds中的数据并对其进行排序

输出样本-

[<tr class="standing-table__row">
<th class="standing-table__cell standing-table__header-cell" data-index="0" data-label="pos" title="Position">#</th>
<th class="standing-table__cell standing-table__header-cell standing-table__cell--name" data-index="1" title="Team">Team</th>
<th class="standing-table__cell standing-table__header-cell" data-index="2" data-label="pld" title="Played">Pl</th>
<th class="standing-table__cell standing-table__header-cell" data-index="9" data-label="pts" data-sort-value="use-attribute">Pts</th>
<th class="standing-table__cell standing-table__header-cell is-hidden--bp15 is-hidden--bp35 " data-index="10" data-sort-value="use-attribute">Last 6</th>
</tr>, <tr class="standing-table__row" data-item-id="345">
<td class="standing-table__cell">1</td>
<td class="standing-table__cell standing-table__cell--name" data-long-name="Manchester City" data-short-name="Manchester City">
<a class="standing-table__cell--name-link" href="/manchester-city">Manchester City</a>
</td>
<td class="standing-table__cell">9</td>
<td class="standing-table__cell is-hidden--bp15 is-hidden--bp35 " data-sort-value="16313333">
<div class="standing-table__form">
<span class="standing-table__form-cell standing-table__form-cell--win" title="Manchester City 2-1 Newcastle United"> </span><span class="standing-table__form-cell standing-table__form-cell--win" title="Manchester City 3-0 Fulham"> </span><span class="standing-table__form-cell standing-table__form-cell--win" title="Cardiff City 0-5 Manchester City"> </span><span class="standing-table__form-cell standing-table__form-cell--win" title="Manchester City 2-0 Brighton and Hove Albion"> </span><span class="standing-table__form-cell standing-table__form-cell--draw" title="Liverpool 0-0 Manchester City"> </span><span class="standing-table__form-cell standing-table__form-cell--win" title="Manchester City 5-0 Burnley"> </span> </div>
</td>
</tr>, <tr class="standing-table__row" data-item-id="155">
<td class="standing-table__cell">2</td>
<td class="standing-table__cell standing-table__cell--name" data-long-name="Liverpool" data-short-name="Liverpool">
  File "C:\Users\scrape.py", line 18, in <module>
    for td in premier_soup_tr.find_all('td', {'class': 'standing-table__cell'}):
  File "C:\Python\Python36\lib\site-packages\bs4\element.py", line 1884, in __getattr__
    "ResultSet object has no attribute '%s'. You're probably treating a list of items like a single item. Did you call find_all() when you meant to call find()?" % key
AttributeError: ResultSet object has no attribute 'find_all'. You're probably treating a list of items like a single item. Did you call find_all() when you meant to call find()?
>>> 

我的代码-

import requests
from bs4 import BeautifulSoup
url = 'https://www.skysports.com/premier-league-table'
premier_r = requests.get(url)
print(premier_r.status_code)
premier_soup = BeautifulSoup(premier_r.text, 'html.parser')
premier_soup_tr = premier_soup.find_all('tr', {'class': 'standing-table__row'})
print(premier_soup_tr)
for td in premier_soup_tr.find_all('td', {'class': 'standing-table__cell'}):
    print(td)

html源代码看起来像-

    <tr class="standing-table__row" data-item-id="345">
  <td class="standing-table__cell">1</td>
  <td class="standing-table__cell standing-table__cell--name" data-short-name="Manchester City" data-long-name="Manchester City">

            <a href="/manchester-city" class="standing-table__cell--name-link">Manchester City</a>

  </td>
  <td class="standing-table__cell">9</td>
  <td class="standing-table__cell">23</td>
  <td class="standing-table__cell is-hidden--bp15 is-hidden--bp35 " data-sort-value="16313333">
          <div class="standing-table__form">
      <span title="Manchester City 2-1 Newcastle United" class="standing-table__form-cell standing-table__form-cell--win"> </span><span title="Manchester City 3-0 Fulham" class="standing-table__form-cell standing-table__form-cell--win"> </span><span title="Cardiff City 0-5 Manchester City" class="standing-table__form-cell standing-table__form-cell--win"> </span><span title="Manchester City 2-0 Brighton and Hove Albion" class="standing-table__form-cell standing-table__form-cell--win"> </span><span title="Liverpool 0-0 Manchester City" class="standing-table__form-cell standing-table__form-cell--draw"> </span><span title="Manchester City 5-0 Burnley" class="standing-table__form-cell standing-table__form-cell--win"> </span>        </div>
        </td>

</tr>
    <tr class="standing-table__row" data-item-id="155">
  <td class="standing-table__cell">2</td>
  <td class="standing-table__cell standing-table__cell--name" data-short-name="Liverpool" data-long-name="Liverpool">

            <a href="/liverpool" class="standing-table__cell--name-link">Liverpool</a>

  </td>

1 个答案:

答案 0 :(得分:1)

您的想法是正确的,但是您必须对所得到的东西做一些事情,find_all将返回一组结果,您不能像premier_soup_tr.find_all那样做,正确的方法是{{ 1}}

这就是我所做的。

premier_soup_tr[position].find_all

输出:

import requests
from bs4 import BeautifulSoup
url = 'https://www.skysports.com/premier-league-table'
premier_r = requests.get(url)
print(premier_r.status_code)
premier_soup = BeautifulSoup(premier_r.text, 'html.parser')
premier_soup_tr = premier_soup.find_all('tr', {'class': 'standing-table__row'})
result = [[r.text.strip() for r in td.find_all('td', {'class': 'standing-table__cell'})][:-1] for td in premier_soup_tr[1:]]
print(result)