熊猫输出奇怪格式的数据帧

时间:2018-08-25 21:26:17

标签: python python-3.x pandas

当我运行它时,它输出一个奇怪的数据帧,说缺少列等。。。尽管我可以在html文件中看到这些列。

import pandas as pd
from bs4 import BeautifulSoup
import lxml.html as lh

with open("htmltabletest.html", encoding="utf-8") as f:
    data = f.read()
    soup = BeautifulSoup(data, 'lxml')

    dfs = pd.read_html(soup.prettify())
    for df in dfs:
        print(df)

这将输出:

   Unnamed: 0           ...                      Price  range
0         NaN           ...            $134.50  to  $2,222.50
1         NaN           ...             $20.39  to  $3,602.50

[2 rows x 5 columns]

当我将其设为htmltabletest.html时:

<table class="dataTable st-alternateRows" id="eventSearchTable">
<thead>
<tr>
<th id="th-es-rb"><div class="dt-th"> </div></th>
<th id="th-es-ed"><div class="dt-th"><span class="th-divider"> </span>Event date<br/>Time (local)</div></th>
<th id="th-es-en"><div class="dt-th"><span class="th-divider"> </span>Event name<br/>Venue</div></th>
<th id="th-es-ti"><div class="dt-th"><span class="th-divider"> </span>Tickets<br/>listed</div></th>
<th id="th-es-pr"><div class="dt-th es-lastCell"><span class="th-divider"> </span>Price<br/>range</div></th>
</tr>
</thead>
<tbody class="" id="eventSearchTbody"><tr class="even" id="r-se-103577924">
<td class="nowrap"><input class="es-selectedEvent" id="se-103577924-check" name="selectEvent" type="radio"/></td>
<td class="nowrap" id="se-103577924-eventDateTime">Thu, 10/11/2018<br/>8:20 p.m.</td>
<td><div><a class="ellip" href="services/priceanalysis?eventId=103577924&amp;sectionId=0" id="se-103577924-eventName" target="_blank">Philadelphia Eagles at New York Giants</a></div><div id="se-103577924-venue">MetLife Stadium, East Rutherford, NJ</div></td>
<td id="se-103577924-nrTickets">6655</td>
<td class="es-lastCell nowrap" id="se-103577924-priceRange"><span id="se-103577924-minPrice">$134.50</span>  to<br/><span id="se-103577924-maxPrice">$2,222.50</span></td>
</tr><tr class="odd" id="r-se-103577925">
<td class="nowrap"><input class="es-selectedEvent" id="se-103577925-check" name="selectEvent" type="radio"/></td>
<td class="nowrap" id="se-103577925-eventDateTime">Thu, 10/11/2018<br/>8:21 p.m.</td>
<td><div><a class="ellip" href="services/priceanalysis?eventId=103577925&amp;sectionId=0" id="se-103577925-eventName" target="_blank">PARKING PASSES ONLY Philadelphia Eagles at New York Giants</a></div><div id="se-103577925-venue">MetLife Stadium Parking Lots, East Rutherford, NJ</div></td>
<td id="se-103577925-nrTickets">929</td>
<td class="es-lastCell nowrap" id="se-103577925-priceRange"><span id="se-103577925-minPrice">$20.39</span>  to<br/><span id="se-103577925-maxPrice">$3,602.50</span></td>
</tr></tbody>
</table>

3 个答案:

答案 0 :(得分:0)

我运行了您的代码,打印效果很好。但是您也应该尝试display(df)

答案 1 :(得分:0)

<tr>
<th id="th-es-rb"><div class="dt-th"> </div></th>
<th id="th-es-ed"><div class="dt-th"><span class="th-divider"> </span>Event date<br/>Time (local)</div></th>
<th id="th-es-en"><div class="dt-th"><span class="th-divider"> </span>Event name<br/>Venue</div></th>
<th id="th-es-ti"><div class="dt-th"><span class="th-divider"> </span>Tickets<br/>listed</div></th>
<th id="th-es-pr"><div class="dt-th es-lastCell"><span class="th-divider"> </span>Price<br/>range</div></th>
</tr>

您的程序运行正常。请注意以下几点:

<th id="th-es-rb"><div class="dt-th"> </div></th>

您没有任何值。如果您将输入更改为ex。

<th id="th-es-rb"><div class="dt-th"> My new column </div></th>

它将正常工作。

我的输出:

In [146]: df.columns

Out[146]: 
Index(['My new cole', 'Event date  Time (local)', 'Event name  Venue',
       'Tickets  listed', 'Price  range'],
      dtype='object')

在[145]中:df

Out[145]: 
   My new cole    Event date  Time (local)  \
0          NaN  Thu, 10/11/2018  8:20 p.m.   
1          NaN  Thu, 10/11/2018  8:21 p.m.   
                                   Event name  Venue  Tickets  listed  \
0  Philadelphia Eagles at New York Giants  MetLif...             6655   
1  PARKING PASSES ONLY Philadelphia Eagles at New...              929   
             Price  range  
0  $134.50  to  $2,222.50  
1   $20.39  to  $3,602.50  

答案 2 :(得分:0)

在我的情况下,答案是我使用的是IDLE而不是pycharm或其他某种程序来运行程序。默认情况下,熊猫打印的宽度不足以容纳我的数据。 here

中已经回答了这个问题