水平解析表

时间:2018-08-25 16:00:10

标签: python python-3.x pandas

我正在尝试从html文件中解析该表,但是输出是垂直输出的。

import pandas as pd
from bs4 import BeautifulSoup

with open("htmltabletest.html", encoding="utf-8") as f:
    data = f.read()
    soup = BeautifulSoup(data, 'lxml')
    trs = soup.findAll('tr')
    for row in trs:
        for nn in row.find_all('td'):
            print(nn.text)

htmltabletest.html

的内容
<table class="dataTable st-alternateRows" id="eventSearchTable">
<thead>
<tr>
<th id="th-es-rb"><div class="dt-th"> </div></th>
<th id="th-es-ed"><div class="dt-th"><span class="th-divider"> </span>Event date<br/>Time (local)</div></th>
<th id="th-es-en"><div class="dt-th"><span class="th-divider"> </span>Event name<br/>Venue</div></th>
<th id="th-es-ti"><div class="dt-th"><span class="th-divider"> </span>Tickets<br/>listed</div></th>
<th id="th-es-pr"><div class="dt-th es-lastCell"><span class="th-divider"> </span>Price<br/>range</div></th>
</tr>
</thead>
<tbody class="" id="eventSearchTbody"><tr class="even" id="r-se-103577924">
<td class="nowrap"><input class="es-selectedEvent" id="se-103577924-check" name="selectEvent" type="radio"/></td>
<td class="nowrap" id="se-103577924-eventDateTime">Thu, 10/11/2018<br/>8:20 p.m.</td>
<td><div><a class="ellip" href="services/priceanalysis?eventId=103577924&amp;sectionId=0" id="se-103577924-eventName" target="_blank">Philadelphia Eagles at New York Giants</a></div><div id="se-103577924-venue">MetLife Stadium, East Rutherford, NJ</div></td>
<td id="se-103577924-nrTickets">6655</td>
<td class="es-lastCell nowrap" id="se-103577924-priceRange"><span id="se-103577924-minPrice">$134.50</span>  to<br/><span id="se-103577924-maxPrice">$2,222.50</span></td>
</tr><tr class="odd" id="r-se-103577925">
<td class="nowrap"><input class="es-selectedEvent" id="se-103577925-check" name="selectEvent" type="radio"/></td>
<td class="nowrap" id="se-103577925-eventDateTime">Thu, 10/11/2018<br/>8:21 p.m.</td>
<td><div><a class="ellip" href="services/priceanalysis?eventId=103577925&amp;sectionId=0" id="se-103577925-eventName" target="_blank">PARKING PASSES ONLY Philadelphia Eagles at New York Giants</a></div><div id="se-103577925-venue">MetLife Stadium Parking Lots, East Rutherford, NJ</div></td>
<td id="se-103577925-nrTickets">929</td>
<td class="es-lastCell nowrap" id="se-103577925-priceRange"><span id="se-103577925-minPrice">$20.39</span>  to<br/><span id="se-103577925-maxPrice">$3,602.50</span></td>
</tr></tbody>
</table>

运行脚本时,输出如下:

Thu, 10/11/20188:20 p.m.
Philadelphia Eagles at New York GiantsMetLife Stadium, East Rutherford, NJ
6655
$134.50  to$2,222.50

Thu, 10/11/20188:21 p.m.
PARKING PASSES ONLY Philadelphia Eagles at New York GiantsMetLife Stadium Parking Lots, East Rutherford, NJ
929
$20.39  to$3,602.50

我正在尝试获得看起来像这样的输出:

 Event date                                                             Event name                     Tickets       Price 
Time (local)                                                            Venue                          Listed        Range


Thu, 10/11/20188:20 p.m.   Philadelphia Eagles at New York GiantsMetLife Stadium, East Rutherford, NJ   6655  $134.50  to$2,222.50

Thu, 10/11/20188:21 p.m.   PARKING PASSES ONLY Philadelphia Eagles at New York GiantsMetLife Stadium Parking Lots, East Rutherford, NJ   929   $20.39  to$3,602.50

我已经忙了整整一天,但仍无法获得所需的输出,因此我们将不胜感激!

1 个答案:

答案 0 :(得分:2)

pd.read_html在此表上做得相当不错:

In [17]: pd.read_html(s)[0].iloc[:, 1:]
Out[17]:
     Event dateTime (local)                                    Event nameVenue  Ticketslisted           Pricerange
0  Thu, 10/11/20188:20 p.m.  Philadelphia Eagles at New York GiantsMetLife ...           6655  $134.50 to$2,222.50
1  Thu, 10/11/20188:21 p.m.  PARKING PASSES ONLY Philadelphia Eagles at New...            929   $20.39 to$3,602.50

看起来此后可能需要进行一些额外的清理,但至少可以为您提供一个良好的起点。