BeautifulSoup:如何从<div>获得不同的物品

时间:2020-06-13 12:50:16

标签: python beautifulsoup

我一直在用BeautifulSoup弄乱网站。 结构是这样的:

<div class="content">
    <div class="cf-listing"><time></div>
    <a class="post-title" href="http://example.com">This is example</a>
    <div class="cf-listing"><time></div>
    <a class="post-title" href="http://example.com">This is example</a>
    .....
</div>

这是在我使用soup.find_all("div", class_="content")之后。我创建了三个空数组,分别是“ time”,“ post-title”和“ url”,我想在它们后面加上a.text,a.href

我正在尝试使用for-each循环,但是我不确定如何使用BeautifulSoup定位特定项目。

我的代码:

from selenium import webdriver
from bs4 import BeautifulSoup

driver = webdriver.Chrome()
driver.get("https://www.example.com")
content = driver.page_source
soup = BeautifulSoup(content)
one = soup.find_all("div", class_="content")

谢谢。

1 个答案:

答案 0 :(得分:0)

您可以使用soup.select()soup.find_all()选择全部<div class="cf-listing">,然后使用.find_next('a')选择下一个''标签。

例如:

from bs4 import BeautifulSoup

txt = '''<div class="content">
    <div class="cf-listing">This is time 1</div>
    <a class="post-title" href="http://example1.com">This is example 1</a>
    <div class="cf-listing">This is time 2</div>
    <a class="post-title" href="http://example2.com">This is example 2</a>
</div>'''

soup = BeautifulSoup(txt, 'html.parser')

time_lst, href_lst, title_lst = [], [] , []
for div in soup.select('.cf-listing'):
    time_lst.append( div.get_text(strip=True) )
    href_lst.append( div.find_next('a')['href'] )
    title_lst.append( div.find_next('a').get_text(strip=True) )

# print it to screen:
for t, href, title in zip(time_lst, href_lst, title_lst):
    print('{:<30}{:<30}{}'.format(t, href, title))

打印:

This is time 1                http://example1.com           This is example 1
This is time 2                http://example2.com           This is example 2