BeautifulSoup抓取问题

时间:2017-01-20 02:19:02

标签: python python-3.x beautifulsoup

我的以下代码(几乎)设法将每个玩家数据划分为行,列值以逗号分隔。但是,似乎玩家名称具有底层子节点,这些子节点也显示在单独的行中。我只想要名称的文字,而不是链接。此外,我的输出中重复了一些记录。任何帮助将不胜感激!我正在使用BS4和Python 3.5。这是我的代码:

import urllib
import urllib.request
from bs4 import BeautifulSoup

def make_soup(url):
    page = urllib.request.urlopen(url)
    soupdata = BeautifulSoup(page, "html.parser")
    return soupdata

currentdata = ""
soup = make_soup("http://www.foxsports.com/soccer/stats? competition=1&season=20160&category=STANDARD&pos=0&team=0&isOpp=0&sort=3&sortOrder=0&page=0")
for record in soup.findAll('tr'):
    playerdata = ""
    for data in record.findAll('td'):
        playerdata = playerdata + "," + data.text
        currentdata = currentdata + "\n" + playerdata

        print(currentdata)

1 个答案:

答案 0 :(得分:1)

import urllib
import urllib.request
from bs4 import BeautifulSoup

def make_soup(url):
    page = urllib.request.urlopen(url)
    soupdata = BeautifulSoup(page, "html.parser")
    return soupdata

currentdata = ""
soup = make_soup("http://www.foxsports.com/soccer/stats? competition=1&season=20160&category=STANDARD&pos=0&team=0&isOpp=0&sort=3&sortOrder=0&page=0")
for record in soup.findAll('tr', class_=False):

    row = [data.get_text(',', strip=True) for data in record.findAll('td')]
    print(' '.join(row))

出:

1,Sánchez, Alexis,Sánchez, A.,ARS 21 20 1786 14 7 30 72 3 0
1,Costa, Diego,Costa, D.,CHE 19 19 1681 14 5 26 57 5 0
1,Ibrahimovic, Zlatan,Ibrahimovic, Z.,MUN 20 20 1800 14 3 36 89 5 0
4,Kane, Harry,Kane, H.,TOT 16 16 1360 13 2 27 53 0 0
5,Lukaku, Romelu,Lukaku, R.,EVE 20 19 1737 12 4 28 55 3 0
5,Defoe, Jermain,Defoe, J.,SUN 21 21 1882 12 2 18 57 1 0
  1. 获取列表中的数据,而不是将它们连接在一起,不要使用字符串来连接。
  2. 要取消选择您不想要的tr,请使用class_=False,这将选择没有tr属性的class
  3. get_text()可以定义分隔符。