我已经尝试了一段时间而且我被卡住了。这就是问题所在:
我正在处理有关CSV文件中的文本的一些元数据。它看起来像这样:
真实的表更长更复杂,但它遵循相同的逻辑:每一行都是文本,每一列都是文本的不同方面。我在一些列中有很多变化,我希望它在一个更简单的改造。例如,从叙事视角转变同性存在和自我存在的价值观到非异质性的价值观。我在另一个名为关键字的CSV文件中定义了这个新模型,如下所示:
如您所见,元数据的每一列都成为新模型关键字中的一行,其中旧值位于term_value列中,新值位于new_model列中。
所以我需要使用Pandas映射或替换这些值。这就是我现在所拥有的:
import re
import pandas as pd
df_metadata = pd.read_csv("/metadata.csv", encoding="utf-8", sep=",")
df_keywords = pd.read_csv("/keywords.csv", encoding="utf-8", sep="\t")
for column_metadata,value_metadata in df_metadata.iteritems():
if str(column_metadata) in list(df_keywords.loc[:,"term_type"]):
df_metadata.loc[df_metadata[column_metadata] == value_metadata, column_metadata] = df_keywords.loc[df_keywords["term_value"] == value_metadata, ["new_model"]]
Python总是会回复此错误:
“ValueError:系列长度必须匹配才能比较”
我认为问题出现在使用loc替换的第二部分的value_metadata中,我的意思是:
df_keywords.loc[df_keywords["term_value"] == value_metadata, ["new_model"]]
我不明白的是为什么value_metadata在这个命令的第一部分工作,但它不在第二部分......
拜托,我将不胜感激任何帮助。也许有一种比迭代数据帧更简单的方法...我对任何建议都很开放。最好的祝福, 何
答案 0 :(得分:1)
您可以先在Multiindex
中创建df_keywords
,以便通过slicers更快地选择,并在旧版本中以map
循环新值:
df_keywords.set_index(['term_type','term_value'], inplace=True)
idx = pd.IndexSlice
#first maping in column narrative-perspective
print (df_keywords.loc[idx['narrative-perspective',:]].to_dict()['new_model'])
{'heterodiegetic': 'heterodiegetic', 'other/mixed': 'n-heterodiegetic',
'homodiegetic': 'n-heterodiegetic', 'autodiegetic': 'n-heterodiegetic'}
#column names for replacing
L = ['narrative-perspective','narrator','protagonist-gender']
for col in L:
df_metadata[col] =
df_metadata[col].map(df_keywords.loc[idx[col,:]].to_dict()['new_model'])
print (df_metadata)
idno author-name narrative-perspective narrator protagonist-gender
0 ne0001 Baroja n-heterodiegetic third-person male
1 ne0002 Galdos heterodiegetic third-person n-male
2 ne0003 Galdos n-heterodiegetic third-person male
3 ne0004 Galdos n-heterodiegetic third-person n-male
4 ne0005 Galdos heterodiegetic third-person n-male
5 ne0006 Galdos heterodiegetic third-person male
6 ne0007 Sawa heterodiegetic third-person n-male
7 ne0008 Zamacois heterodiegetic third-person n-male
8 ne0009 Galdos heterodiegetic third-person n-male
9 ne0011 Galdos n-heterodiegetic n-third-person male
to_dict
也可省略,然后按Series
:
df_keywords.set_index(['term_type','term_value'], inplace=True)
idx = pd.IndexSlice
#first maping in column narrative-perspective
print (df_keywords.loc[idx['narrative-perspective',:]]['new_model'])
term_value
autodiegetic n-heterodiegetic
heterodiegetic heterodiegetic
homodiegetic n-heterodiegetic
other/mixed n-heterodiegetic
Name: new_model, dtype: object
L = ['narrative-perspective','narrator','protagonist-gender']
for col in L:
df_metadata[col] = df_metadata[col].map(df_keywords.loc[idx[col,:]]['new_model'])
print (df_metadata)
idno author-name narrative-perspective narrator protagonist-gender
0 ne0001 Baroja n-heterodiegetic third-person male
1 ne0002 Galdos heterodiegetic third-person n-male
2 ne0003 Galdos n-heterodiegetic third-person male
3 ne0004 Galdos n-heterodiegetic third-person n-male
4 ne0005 Galdos heterodiegetic third-person n-male
5 ne0006 Galdos heterodiegetic third-person male
6 ne0007 Sawa heterodiegetic third-person n-male
7 ne0008 Zamacois heterodiegetic third-person n-male
8 ne0009 Galdos heterodiegetic third-person n-male
9 ne0011 Galdos n-heterodiegetic n-third-person male