Question

我有一个按特定列分组的pandas数据框。现在我想将四个相邻列的数值的平均值插入到一个新列中。这就是我所做的：

df = pd.read_csv(filename)
# in this line I extract a unique ID from the filename
id = re.search('(\w\w\w)', filename).group(1)

文件看起来像这样：

col1   | col2  | col3
-----------------------
str1a  | str1b | float1

我的想法现在如下：

# get the numeric values
df2 = pd.DataFrame(df.groupby(['col1', 'col2']).mean()['col3'].T
# insert the id into a new column
df2.insert(0, 'ID', id)

现在循环遍历所有

for j in range(len(df2.values)):
    for k in df['col1'].unique():
        df2.insert(j+5, (k, 'mean'), df2.values[j])

df2.to_excel('text.xlsx')

但是我得到以下错误，指的是df.insert的行：

TypeError: not all arguments converted during string formatting

和

if not allow_duplicates and item in self.items:
    # Should this be a different kind of error??
    raise ValueError('cannot insert %s, already exists' % item)

我不确定这里是什么字符串格式，因为我只传递了数值。

最终输出应该包含col3中的所有值（由id索引），每第五列应该是前面四个值的插入平均值。

Answer 1

如果我不得不使用像你这样的文件我编写一个函数来转换为csv ...就像这样：

data = []
for lineInFile in file.read().splitlines():
    lineInFile_splited = lineInFile.split('|')
    if len(lineInFile_splited)>1: ## get only data and not '-------'
        data.append(lineInFile_splited)
df = pandas.DataFrame(data, columns = ['A','B'])

希望它有所帮助！

将数据插入分组的DataFrame（pandas）

1 个答案: