我正在Python 3.6中使用pandas数据框来索引文件和属性。我最初的解决方案是使用数据框第一列的文件名和其他列的数字属性。
当我遍历收集属性的文件并尝试将值分配给数据框上的相应列时,这些值没有正确存储。
我经过几次尝试,终于使代码可以工作,但是我不明白为什么最初的解决方案不起作用。
任何人都可以给出一些解释或者更好的解决方案,以便为不会触发警报的数据帧中的元素分配值。 (我知道在这种情况下如何关闭警报,但我宁愿不这样做)
该问题在以下代码中得到了说明。如果以不同的方式创建数据框,并且字符串值列位于不同的位置,例如数据框中的第二列或第三列。
没有尝试使用其他数据类型,例如bool,但我想问题通常与混合数据类型的数据帧有关。
#!/usr/bin/python3
# Import standard libraries
import pandas as pd
import numpy as np
# constants used as label for harmonization with the HDF5 ontology used
ROW_LENGTH = 11
COL1 = 'x1'
COL2 = 'x2'
COL3 = 'x3'
def _main():
# Create a dataframe
first_df = pd.DataFrame(columns=[COL1, COL2, COL3])
first_df[COL1] = ["foo"]*ROW_LENGTH
first_df[COL2] = [np.NaN]*ROW_LENGTH
first_df[COL3] = [np.NaN]*ROW_LENGTH
# Go around assigning data
for row in range(ROW_LENGTH):
first_df[COL1][row] = "{}".format(row)
first_df[COL2][row] = row*2 # Although it gives warning, it works
first_df.loc[row][COL3] = row*3 # And this, that should work, don't
print("Although no data was not stored on the third column using: first_df.loc[row][COL3]")
print(first_df.head())
print("\n...I can retrieve the data like: first_df[COL2][5] = '{}'".format(first_df[COL2][3]))
print("... or like that: first_df.loc[5][COL2] = '{}'".format(first_df.loc[3][COL2]))
# If the first row is numeric...
second_df = pd.DataFrame(columns=[COL1, COL2, COL3])
second_df[COL1] = [0.0]*ROW_LENGTH
second_df[COL2] = [0.0]*ROW_LENGTH
second_df[COL3] = [0.0]*ROW_LENGTH
# Go around assigning data
for row in range(ROW_LENGTH):
second_df[COL1][row] = row*1.0
second_df[COL2][row] = row*2.0
second_df.loc[row][COL3] = row*3.0
print("\nNow if I use only numeric columns, everything works as expected:")
print(second_df.head())
if __name__ == '__main__':
_main()
输出为:
Although no data was not stored on the third column using: first_df.loc[row][COL3]
x1 x2 x3
0 0 0.0 NaN
1 1 2.0 NaN
2 2 4.0 NaN
3 3 6.0 NaN
4 4 8.0 NaN
...I can retrieve the data like: first_df[COL2][5] = '6.0'
... or like that: first_df.loc[5][COL2] = '6.0'
Now if I use only numeric columns, everything works as expected:
x1 x2 x3
0 0.0 0.0 0.0
1 1.0 2.0 3.0
2 2.0 4.0 6.0
3 3.0 6.0 9.0
4 4.0 8.0 12.0
警告消息是这样的
./test.py:24: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
first_df[COL2][row] = row*2 # Although it gives warning, it works
可以使用pd.options.mode.chained_assignment = None
我想代码可以对预期结果进行自我解释,但是总之,我想使用.loc方法访问任何元素。
答案 0 :(得分:1)
使用first_df.loc[row, COL3]
代替first_df.loc[row][COL3]
。
使用first_df.loc[row][COL3]
时,首先使用first_df.loc[row]
创建一个临时系列,然后访问和修改COL3
上的值,并删除该临时系列。等效于:
tmp = first_df.loc[row]
tmp[COL3] = row*3
tmp
永远不会写回到初始DataFrame。