Question

我正在尝试从Espn cricinfo的此表中提取数据。 Website being scraped

每一行都包含以下格式（数据由标题替换）：

<tr class="data1"> <td class="left" nowrap="nowrap"><a>Player Name</a> (Country)</td> <td>Score</td> <td>Minutes Played</td> <td nowrap="nowrap">Balls Faced</td> <td etc... </tr>

我在python脚本中使用了以下代码来捕获表中的值：

bats    = content.xpath('//tr[@class="data1"]/td[1]/a')
cntry   = content.xpath('//tr[@class="data1"]/td[1]/*')
run     = content.xpath('//tr[@class="data1"]/td[2]')
mins    = content.xpath('//tr[@class="data1"]/td[3]')
bf      = content.xpath('//tr[@class="data1"]/td[4]')

然后将数据放入csv文件进行存储。

除了播放器的国家/地区之外，所有数据都被成功捕获。玩家姓名和国家/地区存储在同一<td>标记内;但是，播放器名称也在<a>标签内，可以轻松捕获。我的问题是为球员国家（上面的cntry变量）捕获的值是球员名称。我确信代码不正确，但我不确定原因。

enter image description here

Answer 1

你在哪里：

cntry = content.xpath('//tr[@class="data1"]/td[1]/*')

＆＃39; *＆＃39;正在寻找儿童标签，并通过任何文字传递。

您可以使用此替换代码行来获取文本而不是标记：

cntry = content.xpath('//tr[@class="data1"]/td[1]/text()')

看看它是否适合你。

修改

要删除每个项目开头的白色间距，请执行以下操作：

cntry = content.xpath('//tr[@class="data1"]/td[1]/text()')
cntry = [str(x).strip() for x in cntry]

使用lxmk.html.xpath（）从html表中提取数据时重复

1 个答案: