基本的BeautifulSoup维基百科刮

时间:2016-12-14 21:29:10

标签: python pandas web-scraping beautifulsoup

我正试图从维基百科中获得一个非常基本的,简短的基本无序列表<ul>。我的最终目标是将其放入DataFrame。 我的问题是,我从哪里开始?

In [28]: from bs4 import BeautifulSoup

         import urllib2

         import requests

         from pandas import Series,DataFrame

In [29]: url = "https://en.wikipedia.org/wiki/National_Pro_Grid_League"

In [31]: result = requests.get(url)

In [32]: c = result.content

In [33]: soup = BeautifulSoup(c)

我似乎无法在StackOverflow上找到任何答案,所以我很感激任何人都可以给我的建议。
这是我正在寻找的具体清单:

Active teams[edit]
Baltimore Anthem (2015–present)
Boston Iron (2014–present)
DC Brawlers (2014–present)
Los Angeles Reign (2014–present)
Miami Surge (2014–present)
New York Rhinos (2014–present)
Phoenix Rise (2014–present)
San Francisco Fire (2014–present)

1 个答案:

答案 0 :(得分:3)

首先,您需要找到页面的正确部分。您可以通过找到id="Active_teams"的标题,然后从那里找到下一个<ul>元素来执行此操作。

from bs4 import BeautifulSoup
import requests

url = "https://en.wikipedia.org/wiki/National_Pro_Grid_League"
r = requests.get(url)
soup = BeautifulSoup(r.content)

heading = soup.find(id='Active_teams')
teams = heading.find_next('ul')
for team in team:
    print team.string