我需要python从HTML文件中提取一些数据。
我现在使用的代码如下:
import urllib
recent = urllib.urlopen(http://gamebattles.majorleaguegaming.com/ps4/call-of-duty-ghosts/team/TeamCrYpToNGamingEU/match?id=46057240)
recentsource = recent.read()
我现在需要这个,然后打印另一个团队的该网页表格中的游戏玩家标签列表。
我该怎么做?
由于
答案 0 :(得分:2)
查看Beautiful Soup模块,这是一个很棒的文本解析器。
如果您不想或不能安装它,可以下载源代码,然后将.py文件放在与程序相同的目录中。
为此,请从网站下载并提取代码,然后复制" bs4"将目录放入与python脚本相同的文件夹中。
然后,将其放在代码的开头:
from bs4 import BeautifulSoup
# or
from bs4 import BeautifulSoup as bs
# To type bs instead of BeautifulSoup every single time you use it
您可以从其他stackoverflow问题中学习如何使用它,或查看documentation
答案 1 :(得分:0)
您可以使用html2text完成此项工作,也可以使用ntlk。
示例代码
import nltk
from urllib import urlopen
url = "http://any-url"
html = urlopen(url).read()
raw = nltk.clean_html(html)
print(raw)
答案 2 :(得分:0)
pyparsing有一些有用的构造,用于从HTML页面中提取数据,结果往往是自构造和自命名(如果正确设置了解析器/扫描程序)。以下是此特定网页的pyparsing解决方案:
from pyparsing import *
# for stripping HTML tags
anyTag,anyClose = makeHTMLTags(Word(alphas,alphanums+":_"))
commonHTMLEntity.setParseAction(replaceHTMLEntity)
stripHTML = lambda tokens: (commonHTMLEntity | Suppress(anyTag | anyClose) ).transformString(''.join(tokens))
# make pyparsing expressions for HTML opening and closing tags
# (suppress all from results, as there is no interesting content in the tags or their attributes)
h3,h3End = map(Suppress,makeHTMLTags("h3"))
table,tableEnd = map(Suppress,makeHTMLTags("table"))
tr,trEnd = map(Suppress,makeHTMLTags("tr"))
th,thEnd = map(Suppress,makeHTMLTags("th"))
td,tdEnd = map(Suppress,makeHTMLTags("td"))
# nothing interesting in column headings - parse them, but suppress the results
colHeading = Suppress(th + SkipTo(thEnd) + thEnd)
# simple routine for defining data cells, with optional results name
colData = lambda name='' : td + SkipTo(tdEnd)(name) + tdEnd
playerListing = Group(tr + colData() + colData() +
colData("username") +
colData().setParseAction(stripHTML)("role") +
colData("networkID") +
trEnd)
teamListing = (h3 + ungroup(SkipTo("Match Players" + h3End, failOn=h3))("name") + "Match Players" + h3End +
table + tr + colHeading*5 + trEnd +
Group(OneOrMore(playerListing))("players"))
for team in teamListing.searchString(recentsource):
# use this to print out names and structures of results
#print team.dump()
print "Team:", team.name
for player in team.players:
print "- %s: %s (%s)" % (player.role, player.username, player.networkID)
# or like this
# print "- %(role)s: %(username)s (%(networkID)s)" % player
print
打印:
Team: Team CrYpToN Gaming EU
- Leader: CrYpToN_Crossy (CrYpToN_Crossy)
- Captain: Juddanorty (CrYpToN_Judd)
- Member: BLaZe_Elfy (CrYpToN_Elfy)
Team: eXCeL™
- Leader: Caaahil (Caaahil)
- Member: eSportsmanship (eSportsmanship)
- Member: KillBoy-NL (iClown-x)