Question

我在抓取网页时遇到问题。我正在尝试学习如何做，但是我似乎无法通过一些基础知识。我收到一个错误，“ TypeError：'ResultSet'对象不可调用”是我得到的错误。

我尝试了许多不同的方法。我最初试图使用“ find”而不是“ find_all”函数，但是我遇到了与Beautifulsoup引入nonetype有关的问题。我无法创建可以克服该异常的if循环，因此我尝试使用“ find_all”代替。

page = requests.get('https://topworkplaces.com/publication/ocregister/')

soup = BeautifulSoup(page.text,'html.parser')all_company_list = 
soup.find_all(class_='sortable-table')
#all_company_list = soup.find(class_='sortable-table')


company_name_list_items = all_company_list('td')

for company_name in company_name_list_items:
    #print(company_name.prettify())
    companies = company_name.content[0]

我希望此举可以干净利落地吸引加州橙县的所有公司。如您所见，我已经完成了将它们引入的操作，但是我希望列表是干净的。

Answer 1

熊猫：

熊猫在这里通常很有用。该页面使用多种类型，包括公司规模，排名。我显示等级排序。

import pandas as pd

table = pd.read_html('https://topworkplaces.com/publication/ocregister/')[0]
table.columns = table.iloc[0]
table = table[1:]
table.Rank = pd.to_numeric(table.Rank)
rank_sort_table = table.sort_values(by='Rank', axis=0, ascending = True)
rank_sort_table.reset_index(inplace=True, drop=True)
rank_sort_table.columns.names = ['Index']
print(rank_sort_table)

根据您的排序，公司顺序如下：

print(rank_sort_table.Company)

请求：

顺便说一句，您可以使用nth-of-type仅选择第一列（公司名称），并使用id（而不是类名）来更快地标识表

import requests
from bs4 import BeautifulSoup as bs

r = requests.get('https://topworkplaces.com/publication/ocregister/')
soup = bs(r.content, 'lxml')
names = [item.text for item in soup.select('#twpRegionalList td:nth-of-type(1)')]
print(names)

请注意，默认排序是在名称列上按字母顺序而不是排名。

参考：

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sort_values.html

Answer 2

您有正确的想法。我想，如果您只想查看所有<td>标签（它将为每行（140行）和该行中的每一列（4列）返回一个<td>），而不是立即查找。公司名称，查找所有行（<tr>标签）然后通过迭代每行中的<td>来添加所需的许多列可能会更容易。这将获得第一列，公司名称：

import requests
from bs4 import BeautifulSoup

page = requests.get('https://topworkplaces.com/publication/ocregister/')

soup = BeautifulSoup(page.text,'html.parser')
all_company_list = soup.find_all('tr')

company_list = [c.find('td').text for c in all_company_list[1::]]

现在company_list包含所有140个公司名称：

 >>> print(len(company_list))

['Advanced Behavioral Health', 'Advanced Management Company & R³ Construction Services, Inc.',
...
, 'Wes-Tec, Inc', 'Western Resources Title Company', 'Wunderman', 'Ytel, Inc.', 'Zillow Group']

将c.find('td')更改为c.find_all('td')，然后迭代该列表以获取每个公司的所有列。

类型错误：结果集不可调用-BeautifulSoup

2 个答案: