我有这样的html页面,基本上是关于Microsoft wiki site的Wikipedia的右侧框:
<tbody>
<tr>
<td class="logo" colspan="2" style="text-align:center">
<a class="image" href="/wiki/File:Microsoft_logo_(2012).svg" title="A square divided into four sub-squares, colored red, green, yellow and blue (clockwise), with the company name appearing to its right."><img alt="A square divided into four sub-squares, colored red, green, yellow and blue (clockwise), with the company name appearing to its right." data-file-height="109" data-file-width="512" decoding="async" height="47" src="//upload.wikimedia.org/wikipedia/commons/thumb/9/96/Microsoft_logo_%282012%29.svg/220px-Microsoft_logo_%282012%29.svg.png" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/9/96/Microsoft_logo_%282012%29.svg/330px-Microsoft_logo_%282012%29.svg.png 1.5x, //upload.wikimedia.org/wikipedia/commons/thumb/9/96/Microsoft_logo_%282012%29.svg/440px-Microsoft_logo_%282012%29.svg.png 2x" width="220" /></a>
<div>Microsoft's logo since 2012</div>
</td>
</tr>
<tr>
<td class="logo" colspan="2" style="text-align:center">
<a class="image" href="/wiki/File:Building92microsoft.jpg"><img alt="Building92microsoft.jpg" data-file-height="3456" data-file-width="5184" decoding="async" height="147" src="//upload.wikimedia.org/wikipedia/commons/thumb/3/30/Building92microsoft.jpg/220px-Building92microsoft.jpg" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/3/30/Building92microsoft.jpg/330px-Building92microsoft.jpg 1.5x, //upload.wikimedia.org/wikipedia/commons/thumb/3/30/Building92microsoft.jpg/440px-Building92microsoft.jpg 2x" width="220" /></a>
<div>Building 92 on the <a href="/wiki/Microsoft_Redmond_campus" title="Microsoft Redmond campus">Microsoft Redmond campus</a> in <a href="/wiki/Redmond,_Washington" title="Redmond, Washington">Redmond, Washington</a></div>
</td>
</tr>
<tr>
<th scope="row" style="padding-right:0.5em;">
<div style="padding:0.1em 0;line-height:1.2em;"><a href="/wiki/List_of_legal_entity_types_by_country" title="List of legal entity types by country">Type</a></div>
</th>
<td class="category" style="line-height:1.35em;"><a href="/wiki/Public_company" title="Public company">Public</a></td>
</tr>
<tr>
<th scope="row" style="padding-right:0.5em;"><a href="/wiki/Ticker_symbol" title="Ticker symbol">Traded as</a></th>
<td style="line-height:1.35em;">
<div class="plainlist">
<ul>
<li><a href="/wiki/NASDAQ" title="NASDAQ">NASDAQ</a>: <a class="external text" href="https://www.nasdaq.com/symbol/msft" rel="nofollow">MSFT</a></li>
<li><a href="/wiki/NASDAQ-100" title="NASDAQ-100">NASDAQ-100</a> component</li>
<li><a href="/wiki/Dow_Jones_Industrial_Average" title="Dow Jones Industrial Average">DJIA</a> component</li>
<li><a href="/wiki/S%26P_100" title="S&P 100">S&P 100</a> component</li>
<li><a class="mw-redirect" href="/wiki/S%26P_500" title="S&P 500">S&P 500</a> component</li>
</ul>
</div>
</td>
</tr>
<tr>
<th scope="row" style="padding-right:0.5em;"><a href="/wiki/International_Securities_Identification_Number" title="International Securities Identification Number">ISIN</a></th>
<td style="line-height:1.35em;"><span class="plainlinks nourlexpansion"><a class="external text" href="https://tools.wmflabs.org/isin/?language=de&isin=US5949181045">US5949181045</a></span></td>
</tr>
<tr>
<th scope="row" style="padding-right:0.5em;">Industry</th>
<td class="category" style="line-height:1.35em;">
<div class="plainlist">
<ul>
<li><a class="mw-redirect" href="/wiki/Computer_software" title="Computer software">Computer software</a></li>
<li><a href="/wiki/Computer_hardware" title="Computer hardware">Computer hardware</a></li>
<li><a href="/wiki/Consumer_electronics" title="Consumer electronics">Consumer electronics</a></li>
<li><a href="/wiki/Social_networking_service" title="Social networking service">Social networking service</a></li>
<li><a href="/wiki/Cloud_computing" title="Cloud computing">Cloud computing</a></li>
<li><a href="/wiki/Video_game_industry" title="Video game industry">Video games</a></li>
<li><a href="/wiki/Internet" title="Internet">Internet</a></li>
<li><a href="/wiki/Corporate_venture_capital" title="Corporate venture capital">Corporate venture capital</a></li>
</ul>
</div>
</td>
</tr>
<tr>
<th scope="row" style="padding-right:0.5em;">Founded</th>
<td style="line-height:1.35em;">April 4, 1975<span class="noprint">; 44 years ago</span><span style="display:none"> (<span class="bday dtstart published updated">1975-04-04</span>)</span> in <a href="/wiki/Albuquerque,_New_Mexico" title="Albuquerque, New Mexico">Albuquerque, New Mexico</a>, U.S.</td>
</tr>
<tr>
<th scope="row" style="padding-right:0.5em;">Founders</th>
<td class="agent" style="line-height:1.35em;">
<div class="plainlist">
<ul>
<li><a href="/wiki/Bill_Gates" title="Bill Gates">Bill Gates</a></li>
<li><a href="/wiki/Paul_Allen" title="Paul Allen">Paul Allen</a></li>
</ul>
</div>
</td>
</tr>
<tr>
<th scope="row" style="padding-right:0.5em;">Headquarters</th>
<td class="label" style="line-height:1.35em;"><a href="/wiki/Microsoft_Redmond_campus" title="Microsoft Redmond campus">One Microsoft Way</a>,
<div class="locality" style="display:inline"><a href="/wiki/Redmond,_Washington" title="Redmond, Washington">Redmond</a>, <a href="/wiki/Washington_(state)" title="Washington (state)">Washington</a></div>,
<div class="country-name" style="display:inline">U.S.</div>
</td>
</tr>
<tr>
<th scope="row" style="padding-right:0.5em;">
<div style="padding:0.1em 0;line-height:1.2em;">Area served</div>
</th>
<td style="line-height:1.35em;">Worldwide</td>
</tr>
<tr>
<th scope="row" style="padding-right:0.5em;">
<div style="padding:0.1em 0;line-height:1.2em;">Key people</div>
</th>
<td class="agent" style="line-height:1.35em;">
<div class="plainlist">
<ul>
<li><a href="/wiki/John_W._Thompson" title="John W. Thompson">John W. Thompson</a>
<br/>(<a class="mw-redirect" href="/wiki/Chairman" title="Chairman">Chairman</a>)</li>
<li><a href="/wiki/Satya_Nadella" title="Satya Nadella">Satya Nadella</a>
<br/>(<a href="/wiki/Chief_executive_officer" title="Chief executive officer">CEO</a>)</li>
<li><a href="/wiki/Brad_Smith_(American_lawyer)" title="Brad Smith (American lawyer)">Brad Smith</a>
<br/>(<a href="/wiki/President_(corporate_title)" title="President (corporate title)">President</a>)</li>
<li>Bill Gates
<br/>(<a href="/wiki/Technical_advisor" title="Technical advisor">Technical Advisor</a>)</li>
</ul>
</div>
</td>
</tr>
<tr>
<th scope="row" style="padding-right:0.5em;">Products</th>
<td style="line-height:1.35em;">
<div class="hlist">
<ul>
<li><a href="/wiki/Microsoft_Windows" title="Microsoft Windows">Windows</a></li>
<li><a href="/wiki/Microsoft_Office" title="Microsoft Office">Office</a></li>
<li><a href="/wiki/Microsoft_Servers" title="Microsoft Servers">Servers</a></li>
<li><a href="/wiki/Skype" title="Skype">Skype</a></li>
<li><a href="/wiki/Microsoft_Visual_Studio" title="Microsoft Visual Studio">Visual Studio</a></li>
<li><a href="/wiki/Microsoft_Dynamics" title="Microsoft Dynamics">Dynamics</a></li>
<li><a href="/wiki/Xbox" title="Xbox">Xbox</a></li>
<li><a href="/wiki/Microsoft_Surface" title="Microsoft Surface">Surface</a></li>
<li><a href="/wiki/Microsoft_Mobile" title="Microsoft Mobile">Mobile</a></li>
<li><a href="/wiki/List_of_Microsoft_software" title="List of Microsoft software">List of software</a></li>
</ul>
</div>
</td>
</tr>
<tr>
<th scope="row" style="padding-right:0.5em;">Services</th>
<td class="category" style="line-height:1.35em;">
<div class="hlist">
<ul>
<li><a href="/wiki/Microsoft_Azure" title="Microsoft Azure">Azure</a></li>
<li><a href="/wiki/Bing_(search_engine)" title="Bing (search engine)">Bing</a></li>
<li><a href="/wiki/LinkedIn" title="LinkedIn">LinkedIn</a></li>
<li><a href="/wiki/Microsoft_Developer_Network" title="Microsoft Developer Network">MSDN</a></li>
<li><a href="/wiki/Office_365" title="Office 365">Office 365</a></li>
<li><a href="/wiki/OneDrive" title="OneDrive">OneDrive</a></li>
<li><a href="/wiki/Outlook.com" title="Outlook.com">Outlook.com</a></li>
<li><a href="/wiki/Microsoft_TechNet" title="Microsoft TechNet">TechNet</a></li>
<li><a href="/wiki/Microsoft_Pay" title="Microsoft Pay">Pay</a></li>
<li><a href="/wiki/Microsoft_Store_(digital)" title="Microsoft Store (digital)">Microsoft Store</a></li>
<li><a href="/wiki/Windows_Update" title="Windows Update">Windows Update</a></li>
<li><a href="/wiki/Xbox_Live" title="Xbox Live">Xbox Live</a></li>
</ul>
</div>
</td>
</tr>
<tr>
<th scope="row" style="padding-right:0.5em;">Revenue</th>
<td style="line-height:1.35em;"><img alt="Increase" data-file-height="300" data-file-width="300" decoding="async" height="11" src="//upload.wikimedia.org/wikipedia/commons/thumb/b/b0/Increase2.svg/11px-Increase2.svg.png" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/b/b0/Increase2.svg/17px-Increase2.svg.png 1.5x, //upload.wikimedia.org/wikipedia/commons/thumb/b/b0/Increase2.svg/22px-Increase2.svg.png 2x" title="Increase" width="11" /> <span style="white-space: nowrap"><a href="/wiki/United_States_dollar" title="United States dollar">US$</a>125.8 billion</span><sup class="reference" id="cite_ref-ER-FY19_1-0"><a href="#cite_note-ER-FY19-1">[1]</a></sup> (2019)</td>
</tr>
<tr>
<th scope="row" style="padding-right:0.5em;">
<div style="padding:0.1em 0;line-height:1.2em;"><a href="/wiki/Earnings_before_interest_and_taxes" title="Earnings before interest and taxes">Operating income</a></div>
</th>
<td style="line-height:1.35em;"><img alt="Increase" data-file-height="300" data-file-width="300" decoding="async" height="11" src="//upload.wikimedia.org/wikipedia/commons/thumb/b/b0/Increase2.svg/11px-Increase2.svg.png" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/b/b0/Increase2.svg/17px-Increase2.svg.png 1.5x, //upload.wikimedia.org/wikipedia/commons/thumb/b/b0/Increase2.svg/22px-Increase2.svg.png 2x" title="Increase" width="11" /> <span style="white-space: nowrap">US$43.0 billion</span><sup class="reference" id="cite_ref-ER-FY19_1-1"><a href="#cite_note-ER-FY19-1">[1]</a></sup> (2019)</td>
</tr>
<tr>
<th scope="row" style="padding-right:0.5em;">
<div style="padding:0.1em 0;line-height:1.2em;"><a href="/wiki/Net_income" title="Net income">Net income</a></div>
</th>
<td style="line-height:1.35em;"><img alt="Increase" data-file-height="300" data-file-width="300" decoding="async" height="11" src="//upload.wikimedia.org/wikipedia/commons/thumb/b/b0/Increase2.svg/11px-Increase2.svg.png" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/b/b0/Increase2.svg/17px-Increase2.svg.png 1.5x, //upload.wikimedia.org/wikipedia/commons/thumb/b/b0/Increase2.svg/22px-Increase2.svg.png 2x" title="Increase" width="11" /> <span style="white-space: nowrap">US$39.2 billion</span><sup class="reference" id="cite_ref-ER-FY19_1-2"><a href="#cite_note-ER-FY19-1">[1]</a></sup> (2019)</td>
</tr>
<tr>
<th scope="row" style="padding-right:0.5em;"><span class="nowrap"><a href="/wiki/Asset" title="Asset">Total assets</a></span></th>
<td style="line-height:1.35em;"><img alt="Increase" data-file-height="300" data-file-width="300" decoding="async" height="11" src="//upload.wikimedia.org/wikipedia/commons/thumb/b/b0/Increase2.svg/11px-Increase2.svg.png" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/b/b0/Increase2.svg/17px-Increase2.svg.png 1.5x, //upload.wikimedia.org/wikipedia/commons/thumb/b/b0/Increase2.svg/22px-Increase2.svg.png 2x" title="Increase" width="11" /> <span style="white-space: nowrap">US$286.55 billion</span><sup class="reference" id="cite_ref-ER-FY19_1-3"><a href="#cite_note-ER-FY19-1">[1]</a></sup> (2019)</td>
</tr>
<tr>
<th scope="row" style="padding-right:0.5em;"><span class="nowrap"><a href="/wiki/Equity_(finance)" title="Equity (finance)">Total equity</a></span></th>
<td style="line-height:1.35em;"><img alt="Increase" data-file-height="300" data-file-width="300" decoding="async" height="11" src="//upload.wikimedia.org/wikipedia/commons/thumb/b/b0/Increase2.svg/11px-Increase2.svg.png" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/b/b0/Increase2.svg/17px-Increase2.svg.png 1.5x, //upload.wikimedia.org/wikipedia/commons/thumb/b/b0/Increase2.svg/22px-Increase2.svg.png 2x" title="Increase" width="11" /> <span style="white-space: nowrap">US$102.33 billion</span><sup class="reference" id="cite_ref-ER-FY19_1-4"><a href="#cite_note-ER-FY19-1">[1]</a></sup> (2019)</td>
</tr>
<tr>
<th scope="row" style="padding-right:0.5em;">
<div style="padding:0.1em 0;line-height:1.2em;">Number of employees</div>
</th>
<td style="line-height:1.35em;"><img alt="Increase" data-file-height="300" data-file-width="300" decoding="async" height="11" src="//upload.wikimedia.org/wikipedia/commons/thumb/b/b0/Increase2.svg/11px-Increase2.svg.png" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/b/b0/Increase2.svg/17px-Increase2.svg.png 1.5x, //upload.wikimedia.org/wikipedia/commons/thumb/b/b0/Increase2.svg/22px-Increase2.svg.png 2x" title="Increase" width="11" /> 144,106<sup class="reference" id="cite_ref-2"><a href="#cite_note-2">[2]</a></sup> (2019)</td>
</tr>
<tr>
<th scope="row" style="padding-right:0.5em;"><a href="/wiki/Subsidiary" title="Subsidiary">Subsidiaries</a></th>
<td style="line-height:1.35em;"><a href="/wiki/List_of_mergers_and_acquisitions_by_Microsoft" title="List of mergers and acquisitions by Microsoft">List of Microsoft assets</a></td>
</tr>
<tr>
<th scope="row" style="padding-right:0.5em;">Website</th>
<td style="line-height:1.35em;"><span class="url"><a class="external text" href="https://www.microsoft.com/" rel="nofollow">microsoft.com</a></span></td>
</tr>
</tbody>
如何用这些html代码制作一张这样的表:
如果失败,我尝试使用pandas read_html。然后,我使用了beautifulsoup,它具有许多标签,在某些情况下,Wiki具有与Microsoft页面中不同的其他标签。 很容易,我想提取标签的内部文本。如何使用python并考虑可能还有更多不同的标记。
答案 0 :(得分:2)
代码:
它使用BeautifulSoup
查找第一个表,并在每行中使用th
td
。
有些td
的{{1}}需要下一个循环。
li
结果:
# https://2.python-requests.org/en/master/
# https://www.crummy.com/software/BeautifulSoup/bs4/doc/
import requests
from bs4 import BeautifulSoup as BS
url = 'https://en.wikipedia.org/wiki/Microsoft'
r = requests.get(url)
soup = BS(r.text, 'html.parser')
all_tables = soup.find_all('table')
all_rows = all_tables[0].find_all('tr')
for row in all_rows:
th = row.find('th')
if not th:
continue
title = th.text
td = row.find('td')
all_li = td.find_all('li')
if all_li:
for item in all_li:
print(title, '>', item.get_text())
else:
print(title, '>', td.get_text())
某些生产线仍需要单独清洁。所有这些都没有一个规则,因此它们将需要单独的代码。
答案 1 :(得分:1)
这是获得相同结果的另一种方法。不过,需要做一些清洁工作。
import requests
from bs4 import BeautifulSoup
URL = "https://en.wikipedia.org/wiki/Microsoft"
res = requests.get(URL).text
soup = BeautifulSoup(res,'lxml')
for items in soup.find('table',class_='vcard').find_all('tr'):
[i.extract() for i in items.select("a[href^='#cite']")]
data = items.find_all(['th','td'])
title = data[0].text
if not len(data)>=2:continue
product = ' '.join([' '.join(item.split()) for item in data[1].strings]).strip()
print("{} | {}".format(title,product))
输出:
Type | Public
Traded as | NASDAQ : MSFT NASDAQ-100 component DJIA component S&P 100 component S&P 500 component
ISIN | US5949181045
Industry | Computer software Computer hardware Consumer electronics Social networking service Cloud computing Video games Internet Corporate venture capital
Founded | April 4, 1975 ; 44 years ago ( 1975-04-04 ) in Albuquerque, New Mexico , U.S.
Founders | Bill Gates Paul Allen
Headquarters | One Microsoft Way , Redmond , Washington , U.S.
Area served | Worldwide
Key people | John W. Thompson ( Chairman ) Satya Nadella ( CEO ) Brad Smith ( President ) Bill Gates ( Technical Advisor )
Products | Windows Office Servers Skype Visual Studio Dynamics Xbox Surface Mobile List of software
Services | Azure Bing LinkedIn MSDN Office 365 OneDrive Outlook.com TechNet Pay Microsoft Store Windows Update Xbox Live
Revenue | US$ 125.8 billion (2019)
Operating income | US$43.0 billion (2019)
Net income | US$39.2 billion (2019)
Total assets | US$286.55 billion (2019)
Total equity | US$102.33 billion (2019)
Number of employees | 144,106 (2019)
Subsidiaries | List of Microsoft assets
Website | microsoft.com