[miniconda,python 3]
我的数据.xls下载:(密码:堆栈) Download .xls
0) 您可以注意到我的xls文件在第一行中有大的合并单元格,在第2行和第3行中也有一些合并的单元格。这是一个问题吗?如果这是一个问题 - 我可以以某种方式取消它们吗?
1) 我想删除这个xls的第一行,因为对我来说没有重要的信息。我想问题是该行合并了吗?我想使用df = df.drop([0]),但不是删除这个巨大的第一行,而是删除带有列标题的行(以“ID klienta”开头)。那是为什么?
2) 在我摆脱第一行之后,我喜欢处理来自不同列的一些数字(在我的示例中,我想将数据与“Stav”列分开)。我怎么做?我已经看到某个地方可以仅通过其标题名称(字符串)来索引行/列。例如,我想使用标题“Stav”将数据与列分开:Stav = df ['Stav']
到目前为止我的代码是:
import pandas as pd
import numpy as np
print("\n\n*********************************************")
print("My xls processing script\n")
print("*********************************************\n")
#load data
df = pd.read_excel("file.xls")
#My unsucessful attempt to get rid of first row
#uncomment this and it will remove the second row instead of the first row
#df = df.drop([0])
#print preview of 6 rows 5 columnts
print(df.iloc[0:5, 0:4])
print("\n\n")
#My unsuccessful attempt to get column date with header 'ID'
Stav = df['Stav']
print(Stav)
控制台上的输出:
(xls_env) C:\Users\Slavek\Documents\PythonScripts>python xld_proj.py
*********************************************
My xls processing script
*********************************************
Lidé, které jsem podpořil Unnamed: 1 Unnamed: 2 Unnamed: 3
0 ID klienta Název Stav ID příběhu
1 NaN NaN NaN NaN
2 zonky214882 Jeep na cestě 181187
3 zonky235862 Notebook k práci i relaxu na cestě 206317
4 zonky230378 Dětský pokoj v pořádku 199686
Traceback (most recent call last):
File "C:\miniconda\envs\xls_env\lib\site-packages\pandas\core\indexes\base.py", line 2525, in get_loc
return self._engine.get_loc(key)
File "pandas/_libs/index.pyx", line 117, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/index.pyx", line 139, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/hashtable_class_helper.pxi", line 1265, in pandas._libs.hashtable.PyObjectHashTable.get_item
File "pandas/_libs/hashtable_class_helper.pxi", line 1273, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'Stav'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "xld_proj.py", line 20, in <module>
Stav = df['Stav']
File "C:\miniconda\envs\xls_env\lib\site-packages\pandas\core\frame.py", line 2139, in __getitem__
return self._getitem_column(key)
File "C:\miniconda\envs\xls_env\lib\site-packages\pandas\core\frame.py", line 2146, in _getitem_column
return self._get_item_cache(key)
File "C:\miniconda\envs\xls_env\lib\site-packages\pandas\core\generic.py", line 1842, in _get_item_cache
values = self._data.get(item)
File "C:\miniconda\envs\xls_env\lib\site-packages\pandas\core\internals.py", line 3843, in get
loc = self.items.get_loc(item)
File "C:\miniconda\envs\xls_env\lib\site-packages\pandas\core\indexes\base.py", line 2527, in get_loc
return self._engine.get_loc(self._maybe_cast_indexer(key))
File "pandas/_libs/index.pyx", line 117, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/index.pyx", line 139, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/hashtable_class_helper.pxi", line 1265, in pandas._libs.hashtable.PyObjectHashTable.get_item
File "pandas/_libs/hashtable_class_helper.pxi", line 1273, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'Stav'
答案 0 :(得分:1)
我认为您需要读取
中的标题功能选项df = pd.read_excel("file.xls", header =[0,1,2])
然后你可以删除你不想要的标题:
df.columns = df.columns.droplevel([0,1])
或类似的东西。由于变量名称分散在两个子标题中,因此表格有点混乱。我要把它清理干净,以便它们都在同一条线上。
或保留所有标题,并在此处查看: How do I change or access pandas MultiIndex column headers?
答案 1 :(得分:0)
查看输入的Excel文件的屏幕截图以及打印的数据框,您遇到的问题可能是由于第二行和第三行中的合并单元格所致。
我建议使用文档(Link Here)中概述的pandas.DataFrame.to_excel
的一些参数。特别是,header
和skiprows
可以为您提供帮助。
我在下面提供了一个示例,其中我创建了一个excel文件(.xlsx),它复制了合并单元格的问题。然后我将.xlsx复制为.xls,并使用pandas.DataFrame.to_excel
并header
和skiprows
拼写出来阅读。
import pandas as pd
import numpy as np
import shutil
# Creating a dataframe and saving as test.xlsx in current directory
df = pd.DataFrame(np.random.randn(10, 3), columns=list('ABC'))
writer = pd.ExcelWriter('test.xlsx', engine='xlsxwriter')
df.to_excel(writer, sheet_name='Sheet1', startrow=3, index=False,
header=False)
wb = writer.book
ws = writer.sheets['Sheet1']
ws.merge_range('A1:C1', 'Large Merged Cell in first Row')
ws.merge_range('A2:A3', 'A')
ws.merge_range('B2:B3', 'B')
ws.merge_range('C2:C3', 'C')
wb.close()
print(df)
#copying test.xlsx as a .xls file
shutil.copy(r"test.xlsx" , r"test.xls")
new_df = pd.read_excel('test.xls', header = 0, skiprows = [0,2])
print(new_df)
print(new_df)
应该显示:
A B C
0 1.242498 0.512675 -1.370710
1 0.060366 -0.467702 -1.420735
2 -0.198547 0.042364 0.915423
3 0.340909 0.749019 0.272871
4 2.633348 -1.343251 -0.248733
5 0.892257 0.371924 0.023415
6 -0.809030 -0.633796 0.449373
7 0.322960 2.073352 1.362657
8 -0.848093 1.848489 0.813144
9 2.718069 -0.540174 1.411980