我有一份调查分析,调查由SurveyMonkey的参与者完成。遗憾的是,数据的组织方式并不理想,因为每个问题的每个分类响应都有自己的列。
例如,这里是数据框中其中一个响应的前几行:
How long have you been participating in the Garden Awards Program? \
0 One year
1 NaN
2 NaN
3 NaN
4 NaN
Unnamed: 10 Unnamed: 11 Unnamed: 12 \
0 2-3 years 4-5 years 5 or more years
1 NaN NaN NaN
2 NaN 4-5 years NaN
3 2-3 years NaN NaN
4 NaN NaN 5 or more years
How did you initially learn of the Garden Awards Program? \
0 I nominated my garden to be evaluated
1 NaN
2 I nominated my garden to be evaluated
3 NaN
4 NaN
Unnamed: 14 etc...
0 A friend or family member nominated my garden ...
1 A friend or family member nominated my garden ...
2 NaN
3 NaN
4 NaN
此问题How long have you been participating in the Garden Awards Program?
具有有效回复:one year
,2-3 years
等,并且都在第一行中找到,作为哪个列包含哪个值的键。这是第一个问题。 (同样适用于How did you initially learn of the Garden Awards Program?
,其中有效回复为:I nominated my garden to be evaluated
,A friend or family member nominated my garden
等。)
第二个问题是,每个分类响应的附加列都是Unnamed: N
,其中N是与所有问题相关联的类别的列数。
在每个问题开始重新映射和展平/折叠列到一个列之前,我想知道是否有任何其他方式来处理使用Pandas这样呈现的调查数据。我的所有搜索都指向了SurveyMonkey API,但我不知道它是如何有用的。
我猜我需要压扁列,因此,如果有人可以推荐一种方法,那就太棒了。我认为有一种方法可以通过抓住相邻的列来抓住属于分类响应的所有列,直到Unnamed
不再出现在列名中,但我无法如何做到这一点。
答案 0 :(得分:2)
我将使用以下DataFrame
(可以从here下载为CSV):
Q1 Unnamed: 2 Unnamed: 3 Q2 Unnamed: 5 Unnamed: 6 Q3 Unnamed: 7 Unnamed: 8
0 A1-A A1-B A1-C A2-A A2-B A2-C A3-A A4-B A3-C
1 A1-A NaN NaN NaN A2-B NaN NaN NaN A3-C
2 NaN A1-B NaN A2-A NaN NaN NaN A4-B NaN
3 NaN NaN A1-C NaN A2-B NaN A3-A NaN NaN
4 NaN A1-B NaN NaN NaN A2-C NaN NaN A3-C
5 A1-A NaN NaN NaN A2-B NaN A3-A NaN NaN
主要假设:
Unnamed
开头的列实际上是问题的标题解决方案概述:
pd.Series
)实施(第1部分):
indices = [i for i, c in enumerate(df.columns) if not c.startswith('Unnamed')]
questions = [c for c in df.columns if not c.startswith('Unnamed')]
slices = [slice(i, j) for i, j in zip(indices, indices[1:] + [None])]
你可以看到在下面的切片上进行迭代,你会得到一个与每个问题相对应的DataFrame
:
for q in slices:
print(df.iloc[:, q]) # Use `display` if using Jupyter
实施(第2-3部分):
def parse_response(s):
try:
return s[~s.isnull()][0]
except IndexError:
return np.nan
data = [df.iloc[:, q].apply(parse_response, axis=1)[1:] for q in slices]
df = pd.concat(data, axis=1)
df.columns = questions
输出:
Q1 Q2 Q3
1 A1-A A2-B A3-C
2 A1-B A2-A A4-B
3 A1-C A2-B A3-A
4 A1-B A2-C A3-C
5 A1-A A2-B A3-A