Question

我想使用beautifulsoup从此链接中提取特定行：

http://stats.espncricinfo.com/ci/engine/player/37000.html?class=2;template=results;type=batting

我只希望以year 20XX开头的行（其中XX表示任何年份）。

数据如下：

year 1994       18  17  1   348 90  21.75   491 70.87   0   2   1   25  2
year 1995       16  16  2   514 78* 36.71   637 80.69   0   4   1   44  3
year 1996       21  21  2   708 106* 7.26   957 73.98   1   5   0   71  1
.
.
2007

有人吗？

Answer 1

这是我想出的。您要从中进行抓取的网站具有多个具有多个行的主体，这些主体具有相同的类，有时可能使您难以分隔想要的主体。例如，该页面上的每一行都有一个标签<tr>，该标签指定HTML中的一行。这些<tr>标签分别具有相同的“ data1”类，例如：<tr class="data1">...</tr>。您所要做的就是检查当前行是否包含单词“ year”。为此，您可以执行一个简单的if语句来检查单词“ year”是否在该行中：

import requests
from bs4 import BeautifulSoup
import lxml

link = "http://stats.espncricinfo.com/ci/engine/player/37000.html?class=2;template=results;type=batting"

result = requests.get(link)
source = result.content
soup = BeautifulSoup(source, "lxml")

for i in soup.findAll("tr", {"class":"data1"}):
    text = i.text
    # checking if the row contains the word "year"
    if "year" in text:
        # do stuff with text
        print(text)

编辑 [回应评论]

使用以下内容替换之前的if语句：

if "year" in text:
        row = text.strip().split("\n")
        if '' in row: row.remove('')
        runs = row[4]
        print(runs)

首先，我遍历该行的每个元素。我使用的.strip()和.split()方法将删除行中的空白和\n。这样可以将每个值很好地存储在列表中，如下所示：

['year 1994', '', '18', '17', '1', '348', '90', '21.75', '491', '70.87', '0', '2', '1', '25', '2']

但是，您可以看到列表中索引位置1处有一个空格（''）。要将其从列表中删除，我使用简单的if语句删除空白（如果存在）：

if '' in row: row.remove('')

这为您提供了一个很好的清单，其中包含每年的所有元素：

['year 1994', '18', '17', '1', '348', '90', '21.75', '491', '70.87', '0', '2', '1', '25', '2']

列表中的每个值现在都是网站上该行中的一个元素。返回网站，我们可以看到year 1994的运行次数为348。这是该行的第四个索引。我们可以使用此信息通过usng仅获取每年的跑步次数：

runs = row[4]

最终输出：

从espn statsguru抓取数据

1 个答案: