BeautifulSoup4解析html

时间:2014-08-07 21:11:10

标签: python html parsing beautifulsoup

我需要从这个网站上获取所有高中名字以及他们的城市。使用BeautifulSoup4。我在下面添加了无工作代码。非常感谢。

http://en.wikipedia.org/wiki/List_of_high_schools_in_Texas

import urllib2
bs4 import BeautifulSoup

opener = urllib2.build_opener()
opener.addheaders = [('User-again','Mozilla/5.0' ) ]

url = ("http://en.wikipedia.org/wiki/List_of_high_schools_in_Texas")

ourUrl = opener.open(url).read()

soup = BeautifulSoup(ourUrl)

print get_text(soup.find_all('il')) 

! [html](http://i1074.photobucket.com/albums/w402/phillipjones2/Screenshot2014-08-07at53445PM_zpsebe195cb.png

1 个答案:

答案 0 :(得分:1)

您的计划中有很多错误。以下是一个有用的工作,可以作为额外优化的基础。

import requests # much better than using urllib2
from bs4 import BeautifulSoup # you forgot the `from`

url = "http://en.wikipedia.org/wiki/List_of_high_schools_in_Texas" 
# you don't need () around it
r = requests.get(url) 
# does everything all at once, no need to call `opener` and `read()`
contents = r.text # get the HTML contents of the page

soup = BeautifulSoup(contents)
for item in soup.find_all('li'): # 'li' and 'il' are different things...
    print item.get_text()        # you need to iterate over all the elements
                                 # found by `find_all()`

就是这样。这将为您提供页面上每个<li>...</li>项目的文本。正如您在运行程序时所看到的那样,有许多不相关的结果,例如目录,左侧的菜单项,页脚等。我将其留给你要弄清楚如何只获得学校的名字,并将县名和其他名称分开。

作为参考,请仔细阅读BS docs。他们会回答你的很多问题。