如何将pandas数据框附加到csv,并在必要时创建新列?

时间:2019-10-15 20:14:09

标签: python pandas dataframe

因此,我有一个名为df的输出,该输出来自在许多像这样的电影的循环中创建的熊猫数据帧,因此每个df都是一部电影的数据:

df = pandas.get_dummies(data=df, columns=['genre1', 'genre2', 'genre3', 'genre4'])
df = df.rename(columns=lambda x: x.replace('genre1_', ''))
df = df.rename(columns=lambda x: x.replace('genre2_', ''))
df = df.rename(columns=lambda x: x.replace('genre3_', ''))
df = df.rename(columns=lambda x: x.replace('genre4_', ''))
df = pd.concat([df[col].sum(axis=1).rename(col) if len(df[col].shape)==2 else
                df[col] for col in df.columns.unique()],axis=1)
print(df)
with open('test.csv', 'a') as f:
    df.to_csv(f, mode='a', header=f.tell()==0)

但是问题在于,每个循环之前都有不同的类型。

因此对于第一个循环,输出如下所示:

title     runTime    comedy    action    drama   biography  ......
film1      90mins      1         1         1         1

然后将其分配给csv

但是在循环的下一次迭代中,下一部电影如下:

title     runTime    comedy    action    history     ......
film2      90mins      1         1         1

我现在要创建一个名为history的新列,并在该行中为film2添加一个,为0添加一个film1并分配{{1} }到0上的biographydrama列。

当前,它只是简单地将第一部电影创建为默认电影,然后认为其他所有电影都具有相同的流派。

因此,第一次迭代会生成一个如下所示的df:

enter image description here

第二次迭代如下所示: enter image description here

1 个答案:

答案 0 :(得分:0)

添加到CSV文件使得无法执行操作,因为您冻结了标题,但还没有所有可能的标题。

一种更简单的方法是创建完整的数据框,然后再进行与您已经做的相同的操作,请参见下一个示例代码:

# initialize full dataframe    
df_full = pd.DataFrame()
#... loop reading df of one film data
# once created the raw df, you would do:
df_full = pd.concat([df_full,df])
# when finished run you're code on df_full like:
df_full = pandas.get_dummies(data=df_full, columns=['genre1', 'genre2', 'genre3', 'genre4'])
# continue with the rest of your code, eventually writing the csv file

根据OP注释,我已经编写了处理这种情况的代码, 什么时候不能在内存中做

# initialize the full set of genres list and genre fix columns
full_genre_cols = set()
genre_cols = ['genre1', 'genre2', 'genre3', 'genre4']
# taking created raw dataframe and keeping new genres
df = pd.get_dummies(df, columns=genre_cols, prefix_sep='', prefix='')
actual_genre_cols = df.drop(non_genre_cols, axis=1).columns
full_genre_cols.update(actual_genre_cols)
# ... finish reading all dataframes, and start over 
# this time create full columns dataframes, and append them to CSV file
# preparing raw df and transform
non_genre_cols = df.drop(genre_cols, axis=1).columns
df = pd.get_dummies(df, columns=genre_cols, prefix_sep='', prefix='')
actual_genre_cols = df.drop(non_genre_cols, axis=1).columns
# preparing full columns dataframe
full_cols = list(non_genre_cols)
full_cols.extend(full_genre_cols)
df_fullcols = pd.DataFrame(columns=full_cols)
# updating with current values the correct genre columns
# and resetting to 0 all NaN of genres that not exist in this current cycle
df_fullcols[actual_genre_cols] = df[actual_genre_cols]
df_fullcols.fillna(0, inplace=True)
# and now only left to append to file
with open('test.csv', 'a') as f:
    df_fullcols.to_csv(f, mode='a', header=f.tell()==0)