好的,所以我正在使用漂亮的汤处理html文件,我做了以下事情:
url = "https://en.wikipedia.org/wiki/"+'Category:American_football'
raw = urlopen(url).read()
soup=BeautifulSoup(raw)
pages = soup.find("div" , { "id" : "mw-subcategories" })
cleaned = pages.get_text()
cleaned = cleaned.encode('utf-8')
我的输出如下所示:
"\nSubcategories\nThis category has the following 26 subcategories, out of 26 total.\n\xc2\xa0\n\xe2\x96\xba American football by city\xe2\x80\x8e (5 C)\n\n\n\xe2\x96\xba American football by continent\xe2\x80\x8e (6 C)\n\n\n\xe2\x96\xba American football by country\xe2\x80\x8e (41 C, 1 P)\n\n*\n\xe2\x96\xba American football-related lists\xe2\x80\x8e (6 C, 16 P)\n\nA\n\xe2\x96\xba American football occupations\xe2\x80\x8e (2 C, 6 P)\n\nC\n\xe2\x96\xba American football competitions\xe2\x80\x8e (15 C, 13 P)\n\nE\n\xe2\x96\xba American football equipment\xe2\x80\x8e (16 P)\n\nH\n\xe2\x96\xba History of American football\xe2\x80\x8e (8 C, 14 P)\n\nI\n\xe2\x96\xba American football incidents\xe2\x80\x8e (1 C, 45 P)\n\nM\n\xe2\x96\xba American football media\xe2\x80\x8e (12 C, 16 P)\n\nO\n\xe2\x96\xba American football organisations\xe2\x80\x8e (1 C, 7 P)\n\nP\n\xe2\x96\xba American football people\xe2\x80\x8e (11 C)\n\n\n\xe2\x96\xba American football plays\xe2\x80\x8e (68 P)\n\n\n\xe2\x96\xba American football positions\xe2\x80\x8e (1 C, 41 P)\n\nR\n\xe2\x96\xba American football records and statistics\xe2\x80\x8e (4 C, 8 P)\n\nS\n\xe2\x96\xba Seasons in American football\xe2\x80\x8e (14 C)\n\n\n\xe2\x96\xba Semi-professional American football\xe2\x80\x8e (1 C, 9 P)\n\n\n\xe2\x96\xba American football strategy\xe2\x80\x8e (1 C, 29 P)\n\nT\n\xe2\x96\xba American football teams\xe2\x80\x8e (10 C, 10 P)\n\n\n\xe2\x96\xba American football terminology\xe2\x80\x8e (4 C, 127 P)\n\n\n\xe2\x96\xba American football trophies and awards\xe2\x80\x8e (9 C, 26 P)\n\nV\n\xe2\x96\xba Variations of American football\xe2\x80\x8e (5 C, 12 P)\n\n\n\xe2\x96\xba American football venues\xe2\x80\x8e (2 C, 2 P)\n\nW\n\xe2\x96\xba Women's American football\xe2\x80\x8e (3 C, 3 P)\n\n\xce\x99\n\xe2\x96\xba American football logos\xe2\x80\x8e (3 C, 211 F)\n\n\xce\xa3\n\xe2\x96\xba American football stubs\xe2\x80\x8e (6 C, 218 P)\n\n\n"
我试图找出除了实际文本名称之外的所有内容:即
\xe2\x80\x8e (6 C, 218 P)\n\n\n
有没有一个技巧可以使用美丽的汤库摆脱这个,或者我该如何进一步完善文本?
答案 0 :(得分:1)
导航至您想要的a
。
soup = bs4.BeautifulSoup(raw)
for cat in soup.findAll("a", {"class": "CategoryTreeLabel"}):
print(cat.text)
输出:
American football by city
American football by continent
American football by country
American football-related lists
American football occupations
American football competitions
American football equipment
History of American football
American football incidents
American football media
American football organisations
American football people
American football plays
American football positions
American football records and statistics
Seasons in American football
Semi-professional American football
American football strategy
American football teams
American football terminology
American football trophies and awards
Variations of American football
American football venues
Women's American football
American football logos
American football stubs