我的任务是自动打印Wikipedia信息框数据。例如,我抓取了《星际迷航》维基百科页面(https://en.wikipedia.org/wiki/Star_Trek),并从右侧提取信息框部分并按行打印它们。使用python在屏幕上排。我特别想要信息框。到目前为止,我已经做到了:
from bs4 import BeautifulSoup
import urllib.request
# specify the url
urlpage = 'https://en.wikipedia.org/wiki/Star_Trek'
# query the website and return the html to the variable 'page'
page = urllib.request.urlopen(urlpage)
# parse the html using beautiful soup and store in variable 'soup'
soup = BeautifulSoup(page, 'html.parser')
# find results within table
table = soup.find('table', attrs={'class': 'infobox vevent'})
results = table.find_all('tr')
print(type(results))
print('Number of results', len(results))
print(results)
这为我提供了信息框中的所有内容。如下所示:
[<tr><th class="summary" colspan="2" style="text-align:center;font-
size:125%;font-weight:bold;font-style: italic; background: lavender;">
<i>Star Trek</i></th></tr>, <tr><td colspan="2" style="text-align:center">
<a class="image" href="/wiki/File:Star_Trek_TOS_logo.svg"><img alt="Star
Trek TOS logo.svg" data-file-height="132" data-file-width="560" height="59"
我只想提取数据并在屏幕上打印。所以我想要的是:
Created by Gene Roddenberry
Original work Star Trek: The Original Series
Print publications
Book(s)
List of reference books
List of technical manuals
Novel(s) List of novels
Comics List of comics
Magazine(s)
Star Trek: The Magazine
Star Trek Magazine
依次类推,直到信息框结束。因此,从根本上讲,是打印信息框数据的每一行的一种方法,以便我可以为任何Wiki页面自动进行处理? (所有Wiki页面的信息框表的类别均为'infobox vevent',如代码中所示)
答案 0 :(得分:0)
此页面应该可以帮助您将html解析为一个简单的字符串,而不包含html标签Using BeautifulSoup Extract Text without Tags
这是该页面上的代码,它属于@ 0605002
>>> html = """
<p>
<strong class="offender">YOB:</strong> 1987<br />
<strong class="offender">RACE:</strong> WHITE<br />
<strong class="offender">GENDER:</strong> FEMALE<br />
<strong class="offender">HEIGHT:</strong> 5'05''<br />
<strong class="offender">WEIGHT:</strong> 118<br />
<strong class="offender">EYE COLOR:</strong> GREEN<br />
<strong class="offender">HAIR COLOR:</strong> BROWN<br />
</p>
"""
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(html)
>>> print soup.text
YOB: 1987
RACE: WHITE
GENDER: FEMALE
HEIGHT: 5'05''
WEIGHT: 118
EYE COLOR: GREEN
HAIR COLOR: BROWN
答案 1 :(得分:0)
通过使用beautifulsoup,您需要根据需要重新格式化数据。使用fresult = [e.text for e in result]
获得每个结果
如果您想读取html上的表格,则可以尝试使用类似的代码,尽管这是使用pandas的。
import pandas
urlpage = 'https://en.wikipedia.org/wiki/Star_Trek'
data = pandas.read_html(urlpage)[0]
null = data.isnull()
for x in range(len(data)):
first = data.iloc[x][0]
second = data.iloc[x][1] if not null.iloc[x][1] else ""
print(first,second,"\n")