我正在使用BeautifulSoup来解析VC网站上的公司列表。我找到了可以迭代的正确元素,但似乎无法获得这些元素本身的数据。
这是我要浏览的示例HTML:
<div id="content" class="site-content">
<main id="primary" class="content-area" role="main">
<header class="page-header">
<main id="portfolio-landing-company-list" class="page-content">
<section id="portfolio__list--grid" class="portfolio__list--all">
<div class="company company-stage--venturegrowth company-type--enterprise company--single-company">
<div class="company__thumbnail company__thumbnail-link">
<a href="http://www.domain1.com" title="Company1" target="_blank">
</div>
</div>
<div class="company company-stage--seed company-type--bio company--single-company">
<div class="company__thumbnail company__thumbnail-link">
<a href="http://www.domain2.com" title="Company2" target="_blank">
</div>
</div>
这是我目前使用BeautifulSoup的方式,此部分效果很好:
portfolio = soup.find('div', attrs={'class': 'portfolio-tiles'})
for eachco in portfolio.find_all('article'):
companyname = eachco.a['title']
companyurl = eachco.a['href']
但是我要做的是从这里获取类元素:
<div class="company company-stage--venturegrowth company-type--enterprise company--single-company">
or
<div class="company company-stage--seed company-type--bio company--single-company">
(列表中的每个公司都有多个变体)
我尝试通过以下方式进行迭代:
portfolio = soup.find('div', attrs={'class': 'portfolio-tiles'})
for eachco in portfolio.find_all('article'):
companyattributes = eachco.div['class']
但是会吐出以下行:
['company__thumbnail', 'company__thumbnail-link']
(又名,比我要找的要低的水平)
我如何遍历所有结果,但要获得每个结果的类元素?我感觉我缺少了一些非常基本的东西,但希望能帮助您弄清该东西是什么! / p>
更新
我最终进行了以下操作,使所有内容协同工作:
portfolio = soup.find_all('div', class_=re.compile("company company-"))
for eachco in portfolio:
coname = eachco.a['title']
courl = eachco.a['href']
cotypes = eachco['class']
costage = cotypes[1]
comarket = cotypes[2]
答案 0 :(得分:1)
您可以使用re
模块在类元素中查找特定的文本。
from bs4 import BeautifulSoup
import re
html = """<html><div id="content" class="site-content">
<main id="primary" class="content-area" role="main">
<header class="page-header">
<main id="portfolio-landing-company-list" class="page-content">
<section id="portfolio__list--grid" class="portfolio__list--all">
<div class="company company-stage--venturegrowth company-type--enterprise company--single-company">
<div class="company__thumbnail company__thumbnail-link">(
<a href="http://www.domain1.com" title="Company1" target="_blank">
</div>
</div>
<div class="company company-stage--venturegrowth company-type--enterprise company--single-company">
<div class="company__thumbnail company__thumbnail-link">
<a href="http://www.domain2.com" title="Company2" target="_blank">
</div>
</div> </html>"""
soup = BeautifulSoup(html, 'html.parser')
divs = soup.find_all('div' ,class_=re.compile("stage"))
for div in divs:
print(div['class'])
输出:
[u'company', u'company-stage--venturegrowth', u'company-type--enterprise', u'company--single-company']
[u'company', u'company-stage--venturegrowth', u'company-type--enterprise', u'company--single-company']
答案 1 :(得分:1)
我认为这就是您要寻找的东西
for i in range(len(soup)):
print(soup.select('div[class*="stage"]')[i].attrs['class'])
输出
['company', 'company-stage--venturegrowth', 'company-type--enterprise', 'company--single-company']
['company', 'company-stage--seed', 'company-type--bio', 'company--single-company']y--single-company']