我正在使用维基百科api来获取信息框数据。我想从这个信息框数据中解析website url
。我尝试使用mwparserfromhell来解析网站网址,但不同的关键字有不同的格式。
以下是网站的几种模式 -
url = <!-- {{URL|www.example.com}} -->
| url = [https://www.TheGuardian.com/ TheGuardian.com]
| url = <span class="plainlinks">[https://www.naver.com/ www.naver.com]</span>
|url = [https://www.tmall.com/ tmall.com]
|url = [http://www.ustream.tv/ ustream.tv]
我需要帮助解析维基百科支持的所有模式的official website link
吗?
修改
代码 -
# get infobox data
import requests
# keyword
keyword = 'stackoverflow.com'
# wikipedia api url
api_url = (
'https://en.wikipedia.org/w/api.php?action=query&prop=revisions&'
'rvprop=content&titles=%s&rvsection=0&format=json' % keyword)
# api request
resp = requests.get(api_url).json()
page_one = next(iter(resp['query']['pages'].values()))
revisions = page_one.get('revisions', [])
# infobox daa
infobox_data = next(iter(revisions[0].values()))
# parse website url
import mwparserfromhell
wikicode = mwparserfromhell.parse(infobox_data)
templates = wikicode.filter_templates()
website_url_1 = ''
website_url_2 = ''
for template in templates:
# Pattern - `URL|http://x.com`
if template.name == "URL":
website_url_1 = str(template.get(1).value)
break
if not website_url_1:
# Pattern - `website = http://x.com`
try:
website_url_2 = str(template.get("website").value)
except ValueError:
pass
if not website_url_1:
# Pattern - `homepage = http://x.com`
try:
website_url_2 = str(template.get("homepage").value)
except ValueError:
pass
if website_url_1:
website_url = website_url_1
elif website_url_2:
website_url = website_url_2
答案 0 :(得分:0)
可以使用正则表达式和BeautifulSoup解析您提到的模式。可以想象,人们可以通过扩展这种方法来解析其他模式。
我删除了包含&#39; url =&#39;从一行开始,然后使用BeautifulSoup处理剩余部分。由于BeautifulSoup封装了用于形成完整页面的内容,因此原始内容可以作为body
元素的文本获取。
>>> import re
>>> patterns = '''\
... url = <!-- {{URL|www.example.com}} -->
... | url = [https://www.TheGuardian.com/ TheGuardian.com]
... | url = <span class="plainlinks">[https://www.naver.com/ www.naver.com]</span>
... |url = [https://www.tmall.com/ tmall.com]
... |url = [http://www.ustream.tv/ ustream.tv]'''
>>> import bs4
>>> regex = re.compile(r'\s*\|?\s*url\s*=\s*', re.I)
>>> for pattern in patterns.split('\n'):
... soup = bs4.BeautifulSoup(re.sub(regex, '', pattern), 'lxml')
... if str(soup).startswith('<!--'):
... 'just a comment'
... else:
... soup.find('body').getText()
...
'just a comment'
'[https://www.TheGuardian.com/ TheGuardian.com]'
'[https://www.naver.com/ www.naver.com]'
'[https://www.tmall.com/ tmall.com]'
'[http://www.ustream.tv/ ustream.tv]'
答案 1 :(得分:0)
mwparserfromhell是一个很好的工具:
import mwclient
import mwparserfromhell
site = mwclient.Site('en.wikipedia.org')
text = site.pages[pagename].text()
wikicode = mwparserfromhell.parse(text)
templates = wikicode.filter_templates(matches='infobox .*')
url = templates[0].get('url').value
url_template = url.filter_templates(matches='url')
url_link = url.filter_external_links()
if url_template:
print url_template[0].get(1)
elif url_link:
print url_link.url
else:
print url
答案 2 :(得分:0)
我写了this snippet,这可能会有所帮助:
import collections
import wikipedia
from bs4 import BeautifulSoup
def infobox(wiki_page):
"""Returns the infobox of a given wikipedia page"""
if isinstance(wiki_page, str):
wiki_page = wikipedia.page(wiki_page)
try:
soup = BeautifulSoup(wiki_page.html()).find_all("table", {"class": "infobox"})[0]
except:
return None
ret = collections.defaultdict(dict)
section = ""
for tr in soup.find_all("tr"):
th = tr.find_all("th")
if not any(th):
continue
th = th[0]
if str(th.get("colspan"))=='2':
section = th.text.translate({160:' '}).strip()
continue
k = th.text.translate({160:' '}).strip()
try:
v = tr.find_all("td")[0].text.translate({160:' '}).strip()
ret[section][k] = v
except IndexError:
continue
return dict(ret)