我一直在用BeautifulSoup弄乱网站。 结构是这样的:
<div class="content">
<div class="cf-listing"><time></div>
<a class="post-title" href="http://example.com">This is example</a>
<div class="cf-listing"><time></div>
<a class="post-title" href="http://example.com">This is example</a>
.....
</div>
这是在我使用soup.find_all("div", class_="content")
之后。我创建了三个空数组,分别是“ time”,“ post-title”和“ url”,我想在它们后面加上a.text,a.href
我正在尝试使用for-each循环,但是我不确定如何使用BeautifulSoup定位特定项目。
我的代码:
from selenium import webdriver
from bs4 import BeautifulSoup
driver = webdriver.Chrome()
driver.get("https://www.example.com")
content = driver.page_source
soup = BeautifulSoup(content)
one = soup.find_all("div", class_="content")
谢谢。
答案 0 :(得分:0)
您可以使用soup.select()
或soup.find_all()
选择全部<div class="cf-listing">
,然后使用.find_next('a')
选择下一个''标签。
例如:
from bs4 import BeautifulSoup
txt = '''<div class="content">
<div class="cf-listing">This is time 1</div>
<a class="post-title" href="http://example1.com">This is example 1</a>
<div class="cf-listing">This is time 2</div>
<a class="post-title" href="http://example2.com">This is example 2</a>
</div>'''
soup = BeautifulSoup(txt, 'html.parser')
time_lst, href_lst, title_lst = [], [] , []
for div in soup.select('.cf-listing'):
time_lst.append( div.get_text(strip=True) )
href_lst.append( div.find_next('a')['href'] )
title_lst.append( div.find_next('a').get_text(strip=True) )
# print it to screen:
for t, href, title in zip(time_lst, href_lst, title_lst):
print('{:<30}{:<30}{}'.format(t, href, title))
打印:
This is time 1 http://example1.com This is example 1
This is time 2 http://example2.com This is example 2