我的目的是从Wikipedia page of Microsoft的信息框中提取“ Founded”和“ Products”信息。我正在使用python 3,并且使用了以下在网上找到的代码,但该代码无法正常工作
# importing modules
import requests
from lxml import etree
# manually storing desired URL
url='https://en.wikipedia.org/wiki/Microsoft'
# fetching its url through requests module
req = requests.get(url)
store = etree.fromstring(req.text)
# trying to get the 'Founded' portion of above
# URL's info box of Wikipedia's page
output = store.xpath('//table[@class="infoboxvcard"]/tr[th/text()="Founded"]/td/i')
# printing the text portion
print output[0].text
#Expected result:
Founded:April 4, 1975; 43 years ago in Albuquerque, New Mexico, U.S.
答案 0 :(得分:2)
使用了不正确的Xpath。我从问题中提供的Wikipedia页面检索到该元素的正确XPath。我还在括号中添加了用于Python 3兼容性的print语句。
尝试:
# importing modules
import requests
from lxml import etree
# manually storing desired URL
url='https://en.wikipedia.org/wiki/Microsoft'
# fetching its url through requests module
req = requests.get(url)
store = etree.fromstring(req.text)
# an incorrect xpath was being used
output = store.xpath('//*[@id="mw-content-text"]/div/table[2]/tbody/tr[7]/td')
# added parenthesis python 3
print (output[0].text)
我得到:
April 4, 1975
答案 1 :(得分:0)
您可能应该使用mwparserfromhell
来尝试自行解析mediawiki标记。使用mwparsefromhell
,您可以过滤出模板,然后提取它们的各个参数。
code = mwparserfromhell.parse(text)
for template in code.filter_templates():
if template.name.matches("infobox"):
for p in template:#...