我正在尝试抓取网页并将结果存储在csv / excel文件中。我正在为此使用漂亮的汤。
我正在尝试使用find_all函数从汤中提取数据,但是我不确定如何在字段名称或标题中捕获数据
HTML文件具有以下格式
<h3 class="font20">
<span itemprop="position">36.</span>
<a class="font20 c_name_head weight700 detail_page"
href="/companies/view/1033/nimblechapps-pvt-ltd" target="_blank"
title="Nimblechapps Pvt. Ltd.">
<span itemprop="name">Nimblechapps Pvt. Ltd. </span>
</a> </h3>
到目前为止,这是我的代码。不确定如何从这里继续
from bs4 import BeautifulSoup as BS
import requests
page = 'https://www.goodfirms.co/directory/platform/app-development/iphone?
page=2'
res = requests.get(page)
cont = BS(res.content, "html.parser")
names = cont.find_all(class_ = 'font20 c_name_head weight700 detail_page')
names = cont.find_all('a' , attrs = {'class':'font20 c_name_head weight700
detail_page'})
我尝试使用以下-
Input: cont.h3.a.span
Output: <span itemprop="name">Nimblechapps Pvt. Ltd.</span>
我想提取公司的名称-“ Nimblechapps Pvt。Ltd。”
答案 0 :(得分:2)
您可以为此使用列表理解:
from bs4 import BeautifulSoup as BS
import requests
page = 'https://www.goodfirms.co/directory/platform/app-development/iphone?page=2'
res = requests.get(page)
cont = BS(res.content, "html.parser")
names = cont.find_all('a' , attrs = {'class':'font20 c_name_head weight700 detail_page'})
print([n.text for n in names])
您将获得:
['Nimblechapps Pvt. Ltd.', (..) , 'InnoApps Technologies Pvt. Ltd', 'Umbrella IT', 'iQlance Solutions', 'getyoteam', 'JetRuby Agency LTD.', 'ONLINICO', 'Dedicated Developers', 'Appingine', 'webnexs']
答案 1 :(得分:1)
相同,但使用后代组合器" "
将类型选择器a
与attribute = value选择器[itemprop="name"]
组合起来
names = [item.text for item in cont.select('a [itemprop="name"]')]
答案 2 :(得分:1)
尽量不要在脚本中使用复合类,因为它们容易被破坏。以下脚本也应获取您所需的内容。
import requests
from bs4 import BeautifulSoup
link = "https://www.goodfirms.co/directory/platform/app-development/iphone?page=2"
res = requests.get(link)
soup = BeautifulSoup(res.text, 'html.parser')
for items in soup.find_all(class_="commoncompanydetail"):
names = items.find(class_='detail_page').text
print(names)