Question

这是我第一次尝试编码，所以请原谅我的愚蠢。我正在尝试通过以下链接练习网络抓取： https://data.gov.sg/dataset/industrial-arbitration-court-awards-by-nature-of-trade-disputes?view_id=d3e444ef-54ed-4d0b-b715-1ee465f6d882&resource_id=c24d0d00-2d12-4f68-8fc9-4121433332e0

我老老实实地花了几个小时试图弄清楚我的代码出了什么问题：

import csv
import requests
from BeautifulSoup import BeautifulSoup

url = 'https://data.gov.sg/dataset/industrial-arbitration-court-awards-by-nature-of-trade-disputes?view_id=d3e444ef-54ed-4d0b-b715-1ee465f6d882&resource_id=c24d0d00-2d12-4f68-8fc9-4121433332e0'
response = requests.get(url)
html = response.content

soup = BeautifulSoup(html)
table = soup.find('tbody')

list_of_rows = []
for row in table.find('tr'):
    list_of_cells = []
    for cell in row.findAll('td'):
        list_of_cells.append()
    list_of_rows.append(list_of_cells)

outfile = open("./indarb.csv","wb")
writer = csv.writer(outfile)

我的终端然后吐出这个：'NoneType'对象没有属性'find'，说第13行有错误。不确定它是否有助于查询，但这是我尝试过的列表：

'find'/'findAll'的不同排列

使用'.findAll'
使用'.find'

第10行的不同排列

尝试过soup.find（'tbody'）
尝试过soup.find（'table'）
打开源代码，尝试过soup.find（'table'，attrs = {'class'：'table table-condensed'}）

第13行的不同排列

同样尝试使用'tr'标签;或
尝试添加'attrs = {}'stuff

我真的尝试但无法弄清楚为什么我不能刮掉那个简单的10排表。如果任何人都可以发布有效的代码，那就太惊人了。感谢您的耐心等待！

Answer 1

您在代码中请求的网址不是HTML而是JSON。

Answer 2

你有一些错误，最大的问题是你使用的 BeautifulSoup3 多年未开发，你应该使用bs4，你还需要使用find_all当你想要多个标签时。此外，您还没有将单元格传递给第13行的list_of_cells.append()，这是导致其他错误的原因：

from bs4 import BeautifulSoup

url = 'https://data.gov.sg/dataset/industrial-arbitration-court-awards-by-nature-of-trade-disputes?view_id=d3e444ef-54ed-4d0b-b715-1ee465f6d882&resource_id=c24d0d00-2d12-4f68-8fc9-4121433332e0%27'
response = requests.get(url)
html = response.content

soup = BeautifulSoup(html)
table = soup.find('table')

list_of_rows = []
for row in table.find_all('tr'):
    list_of_cells = []
    for cell in row.find_all('td'):
        list_of_cells.append(cell)
    list_of_rows.append(list_of_cells)

我不确定你想要什么，但是它会从页面上的第一个表中附加tds。如果您确实需要数据，还可以使用api和downloadable csv。

使用Python进行Web抓取：NoneType错误，无法抓取表的数据

2 个答案: