使用BS4进行网页抓取,按div ID进行选择仍会返回整页

时间:2020-06-16 03:00:14

标签: python web-scraping beautifulsoup

我正试图从Wikipedia的当前事件页面上抓取以下内容:https://en.wikipedia.org/wiki/Portal:Current_events。特别是当前日期。使用inspect元素,我可以看到我想要的所有信息都存储在id为“ 2020_June_15”的div中。在我的脚本中,我指定了特定的ID,但是我当前的脚本继续从页面中提取所有内容。我想念什么?

这是python脚本wiki.py:

import sys
import requests
import bs4

res = requests.get('https://en.wikipedia.org/wiki/Portal:Current_events')
res.raise_for_status()


soup = bs4.BeautifulSoup(res.text,"lxml")
elems = soup.select('div', {"id": "2020_June_15"})
for i in range(len(elems)):
    print(elems[i].getText())

2 个答案:

答案 0 :(得分:0)

替换汤。用汤选择。find_all

import sys
import requests
import bs4

res = requests.get('https://en.wikipedia.org/wiki/Portal:Current_events')
res.raise_for_status()


soup = bs4.BeautifulSoup(res.text,"lxml")
elems = soup.find_all('div', {"id": "2020_June_15"})
for i in range(len(elems)):
    print(elems[i].getText())

答案 1 :(得分:0)

您真的很亲近。代替“选择”,使用“查找”

import sys
import requests
import bs4

res = requests.get('https://en.wikipedia.org/wiki/Portal:Current_events')
res.raise_for_status()

soup = bs4.BeautifulSoup(res.text,"lxml")
elems = soup.find('div', {"id": "2020_June_15"})
for i in range(len(elems)):
    print(elems[i].getText())