为什么值“ External_links”而不是从网站上删除的项目?

时间:2018-07-22 03:06:22

标签: python web-scraping beautifulsoup urllib

我的代码在下面,但是为什么brand值输出External_links而不是我提取的项目列表。

from bs4 import BeautifulSoup as soup
from urllib.request import urlopen as uReq


my_url = 'https://en.wikipedia.org/wiki/Harry_Potter'
uClient = uReq(my_url)
page_html = uClient.read()
uClient.close()

page_soup = soup(page_html,"html.parser")
headline = page_soup.findAll("span",{"class":"mw-headline"})

for item in headline:
    brand = item["id"] # Outputs "External_links"

3 个答案:

答案 0 :(得分:1)

在您的for循环中,您要遍历页面中的每个标题,然后将标题值分配给变量brand。循环结束后,brand的值将是最后一个标题(“外部链接”)。

如果您修改代码以打印每个标题的值,则会看到您正在获取所需的值。

>>> for item in headline:
...    print(item["id"])
...
Plot
Early_years
Voldemort_returns
Supplementary_works
Harry_Potter_and_the_Cursed_Child
In-universe_books
Pottermore_website
Structure_and_genre
Themes
Origins
Publishing_history
Translations
Completion_of_the_series
Cover_art
Achievements
Cultural_impact
Commercial_success
Awards,_honours,_and_recognition
Reception
Literary_criticism
Social_impact
Controversies
Adaptations
Films
Spin-off_prequels
Games
Audiobooks
Stage_production
Attractions
The_Wizarding_World_of_Harry_Potter
The_Making_of_Harry_Potter
References
Further_reading
External_links

答案 1 :(得分:0)

您的import re def main(): mytext = open("m.txt") mypattern = re.compile('n. (m.|f.)') for line in mytext: match = re.search(mypattern, line) if match: print(match.group()) if __name__ == "__main__": main() 变量必须是一个列表,例如代码可能像这样:

brand

打印:

from bs4 import BeautifulSoup as soup
from urllib.request import urlopen as uReq
from pprint import pprint

my_url = 'https://en.wikipedia.org/wiki/Harry_Potter'
with uReq(my_url) as uClient:
    page_html = uClient.read()
    page_soup = soup(page_html, "xml")

brand = []
for item in page_soup.find_all('span', {'class': 'mw-headline'}):
    brand.append(item["id"])

pprint(brand)

答案 2 :(得分:0)

使用列表理解实现相同目的:

import requests
from bs4 import BeautifulSoup
from pprint import pprint

url = 'https://en.wikipedia.org/wiki/Harry_Potter'

soup = BeautifulSoup(requests.get(url).text, "lxml")
items = [item.get('id') for item in soup.find_all('span',class_='mw-headline')]
pprint(items)