我正在努力使用网址列表来提取数据。我试图使用此代码从一个URL中获取数据:
r = requests.get('https://www.horizont.net/marketing/nachrichten/anzeige.-digitalisierung-wie-software-die-kreativitaet-steigert-178413')
c = r.content
soup = BeautifulSoup(c, 'html.parser')
print(soup.prettify())
,然后从页面中简单定义我需要的内容:
all = soup.select('.PageArticle')
title = []
author = []
publish_date = []
article_main_content = []
article_body = []
for item in all:
t = item.find_all('h1')[0].text
title.append(t)
a = item.find_all('span')[2].text
author.append(a)
p = item.find_all('span')[5].text
publish_date.append(p)
amc = item.select('.PageArticle_lead-content')[0].text
article_main_content.append(amc)
a_body = item.select('.PageArticle_body')[0].text
article_body.append(a_body)
现在我有URLS列表,并希望得到相同的结果,但要通过URL列表...任何想法如何?目前,我得到以下输出:
Schweizer Illustrierte und L'illustré rücken näher zusammen
Beat Hürlimann
11. November 2019
但是对于所有网址“文章名称”,“作者”和“发布日期”,我都需要相同的结果
答案 0 :(得分:0)
挣扎之后。我能够找到解决方案。
for url in url_list:
r = requests.get(url)
c = r.content
soup = BeautifulSoup(c, 'html.parser')
all = soup.select('.PageArticle')
for item in all:
t = item.find_all('h1')[i].text
title.append(t)
print(t)
a = item.find_all('span')[j].text
author.append(a)
print(a)
p = item.find_all('span')[k].text
publish_date.append(p)
print(p)
i = i +1
j += 1
k += 1
l += 1