清洁文本用美丽的汤

时间:2015-06-22 15:01:06

标签: python beautifulsoup wikipedia

好的,所以我正在使用漂亮的汤处理html文件,我做了以下事情:

url = "https://en.wikipedia.org/wiki/"+'Category:American_football'
raw = urlopen(url).read()
soup=BeautifulSoup(raw)
pages = soup.find("div" , { "id" : "mw-subcategories" })
cleaned = pages.get_text()
cleaned = cleaned.encode('utf-8')

我的输出如下所示:

"\nSubcategories\nThis category has the following 26 subcategories, out of 26 total.\n\xc2\xa0\n\xe2\x96\xba  American football by city\xe2\x80\x8e (5 C)\n\n\n\xe2\x96\xba  American football by continent\xe2\x80\x8e (6 C)\n\n\n\xe2\x96\xba  American football by country\xe2\x80\x8e (41 C, 1 P)\n\n*\n\xe2\x96\xba  American football-related lists\xe2\x80\x8e (6 C, 16 P)\n\nA\n\xe2\x96\xba  American football occupations\xe2\x80\x8e (2 C, 6 P)\n\nC\n\xe2\x96\xba  American football competitions\xe2\x80\x8e (15 C, 13 P)\n\nE\n\xe2\x96\xba  American football equipment\xe2\x80\x8e (16 P)\n\nH\n\xe2\x96\xba  History of American football\xe2\x80\x8e (8 C, 14 P)\n\nI\n\xe2\x96\xba  American football incidents\xe2\x80\x8e (1 C, 45 P)\n\nM\n\xe2\x96\xba  American football media\xe2\x80\x8e (12 C, 16 P)\n\nO\n\xe2\x96\xba  American football organisations\xe2\x80\x8e (1 C, 7 P)\n\nP\n\xe2\x96\xba  American football people\xe2\x80\x8e (11 C)\n\n\n\xe2\x96\xba  American football plays\xe2\x80\x8e (68 P)\n\n\n\xe2\x96\xba  American football positions\xe2\x80\x8e (1 C, 41 P)\n\nR\n\xe2\x96\xba  American football records and statistics\xe2\x80\x8e (4 C, 8 P)\n\nS\n\xe2\x96\xba  Seasons in American football\xe2\x80\x8e (14 C)\n\n\n\xe2\x96\xba  Semi-professional American football\xe2\x80\x8e (1 C, 9 P)\n\n\n\xe2\x96\xba  American football strategy\xe2\x80\x8e (1 C, 29 P)\n\nT\n\xe2\x96\xba  American football teams\xe2\x80\x8e (10 C, 10 P)\n\n\n\xe2\x96\xba  American football terminology\xe2\x80\x8e (4 C, 127 P)\n\n\n\xe2\x96\xba  American football trophies and awards\xe2\x80\x8e (9 C, 26 P)\n\nV\n\xe2\x96\xba  Variations of American football\xe2\x80\x8e (5 C, 12 P)\n\n\n\xe2\x96\xba  American football venues\xe2\x80\x8e (2 C, 2 P)\n\nW\n\xe2\x96\xba  Women's American football\xe2\x80\x8e (3 C, 3 P)\n\n\xce\x99\n\xe2\x96\xba  American football logos\xe2\x80\x8e (3 C, 211 F)\n\n\xce\xa3\n\xe2\x96\xba  American football stubs\xe2\x80\x8e (6 C, 218 P)\n\n\n"

我试图找出除了实际文本名称之外的所有内容:即

\xe2\x80\x8e (6 C, 218 P)\n\n\n

有没有一个技巧可以使用美丽的汤库摆脱这个,或者我该如何进一步完善文本?

1 个答案:

答案 0 :(得分:1)

导航至您想要的a

soup = bs4.BeautifulSoup(raw)
for cat in soup.findAll("a", {"class": "CategoryTreeLabel"}):
    print(cat.text)

输出:

American football by city
American football by continent
American football by country
American football-related lists
American football occupations
American football competitions
American football equipment
History of American football
American football incidents
American football media
American football organisations
American football people
American football plays
American football positions
American football records and statistics
Seasons in American football
Semi-professional American football
American football strategy
American football teams
American football terminology
American football trophies and awards
Variations of American football
American football venues
Women's American football
American football logos
American football stubs