我正在尝试从 this website 中抓取一些数据。 我将不胜感激。
每页有 30 个条目,我目前正在尝试从每页上的每个链接中抓取信息。这是我的代码:
import pandas as pd
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import time
driver = webdriver.Chrome()
print(driver.title)
driver.get("https://www.businesslist.com.ng/category/farming")
time.sleep(1)
select = driver.find_element_by_id("listings")
page_entries = [i.find_element_by_tag_name("a").get_attribute("href")
for i in select.find_elements_by_tag_name("h4")]
columns = {"ESTABLISHMENT YEAR":[], "EMPLOYEES":[], "COMPANY MANAGER":[],
"VAT REGISTRATION":[], "REGISTRATION CODE":[]}
for i in page_entries:
print(i)
driver.get(i)
listify_subentries = [i.text.strip().replace("\n","") for i in
driver.find_elements_by_class_name("info")][:11]
到这里为止一切正常。问题可能出在下面的部分。
for i in listify_subentries:
for q in columns.keys():
if q in i:
item = i.replace(q,"")
print(item)
columns[q].append(item)
else:
columns[q].append("None given")
print("None given")
Here's a picture of the layout for one entry. Sorry I can't yet embed images.
我正在尝试从每个企业的页面上抓取“工作时间”框下的一些信息(即成立年份、公司经理等)。您可以在 columns
变量下找到确切信息。
因为并非所有页面的“工作时间”框(here is one with more details underneath it)下的信息量都相同,我尝试使用字典+文本操作来查找可用的子条目并获取相关信息以他们的权利。也就是说,获取公司经理的姓名、成立年份等;如果一个页面没有这个,那么它只会在相关的子条目下被标记为“没有给出”。
这个想法是整理所有这些信息,然后将其导出到数据帧。当页面缺少特定子条目时输入“无给定”允许我保持数据结构的完整性,从而确保条目对齐。
但是,当我运行代码时,我收到的输出完全关闭。
Here is the outer view of the columns dictionary once the code has run.
And if I click on the 'COMPANY MANAGER' section, you can see that there are multiple instances of it saying "None given" before it gives the name of company manager on the page. 对每个其他子条目重复此操作,如果您运行代码并向下滚动,您将看到。我不确定出了什么问题,但似乎列表的大小已经膨胀了 10 倍,到处都是额外的“没有给出”。每个列表的大小应该是 30,但现在是 330。
我将不胜感激。谢谢。
答案 0 :(得分:0)
您可以使用下一个示例如何迭代该页面上的所有企业并将各种信息保存到数据框中:
import requests
import pandas as pd
from bs4 import BeautifulSoup
url = "https://www.businesslist.com.ng/category/farming"
soup = BeautifulSoup(requests.get(url).content, "html.parser")
all_data = []
for a in soup.select("h4 > a"):
u = "https://www.businesslist.com.ng" + a["href"]
print(u)
data = {"URL": u}
s = BeautifulSoup(requests.get(u).content, "html.parser")
for info in s.select("div.info:has(.label)"):
label = info.select_one(".label")
label.extract()
value = info.get_text(strip=True, separator=" ")
data[label.get_text(strip=True)] = value
all_data.append(data)
df = pd.DataFrame(all_data)
print(df)
df.to_csv("data.csv", index=None)
打印:
URL Company name Address Phone Number Mobile phone Working hours Establishment year Employees Company manager Share this listing Location map Description Products & Services Listed in categories Keywords Website Registration code VAT registration E-mail Fax
0 https://www.businesslist.com.ng/company/198846/macmed-integrated-farms Macmed Integrated Farms 1, Gani Street, Ijegun Imore. Satellite Town. Lagos , Lagos , Lagos , Nigeria View Map 08033316905 09092245349 Monday: 8am-5pm Tuesday: 8am-5pm Wednesday: 8am-5pm Thursday: 8am-5pm Friday: 8am-5pm Saturday: 8am-5pm Sunday: 10am-4pm 2013 1-5 Engr. Marcus Awoh Show Map Expand Map 1, Gani Street, Ijegun Imore. Satellite Town. Lagos , Lagos , Lagos , Nigeria Get Directions Macmed Integrated Farms is into Poultry, Fish Farming (eggs, meat,day old chicks,fingerlings and grow-out) and animal Husbandry and sales of Farmlands land and facilities We also provide feasibility studies and business planning for all kind of businesses. Day old chicks WE are receiving large quantity of Day old Pullets, Broilers and cockerel in December 2016.\nInterested buyers are invited. PRICE: N 100 - N 350 Investors/ Partners We Macmed Integrated Farms a subsidiary of Macmed Cafe Limited RC (621444) are into poultry farming situated at Iponsinyi, behind (Nigerian National Petroleum Marketing Company)NNPMC at Mosimi, along... Commercial Hatchery Macmed Integrated Farms is setting up a Hatchery for chicken and other birds. We have 2 nos of fully automatic incubator imported from China with combined capacity of 1,500 eggs per setting.\nPlease book in advance.\nMarcus Awoh.\nfarm Operations Manager. PRICE: N100 - N250 Business Services Business Services / Consultants Business Services / Small Business Business Services / Small Business / Business Plans Business Services / Animal Shelters Manufacturing & Industry Manufacturing & Industry / Farming Manufacturing & Industry / Farming / Poultry Housing Suppliers Catfish Day old chicks Farming FINGERLINGS Fishery grow out and aquaculture Meat Poultry eggs spent Pol MORE +4 NaN NaN NaN NaN NaN
...
并保存 data.csv
(来自 LibreOffice 的屏幕截图):