使用Pandas进行SurveyMonkey数据格式化

时间:2018-03-30 20:16:47

标签: pandas flatten surveymonkey

我有一份调查分析,调查由SurveyMonkey的参与者完成。遗憾的是,数据的组织方式并不理想,因为每个问题的每个分类响应都有自己的列。

例如,这里是数据框中其中一个响应的前几行:

     How long have you been participating in the Garden Awards Program?  \
0                                           One year                   
1                                                NaN                   
2                                                NaN                   
3                                                NaN                   
4                                                NaN                   

  Unnamed: 10 Unnamed: 11      Unnamed: 12  \
0   2-3 years   4-5 years  5 or more years   
1         NaN         NaN              NaN   
2         NaN   4-5 years              NaN   
3   2-3 years         NaN              NaN   
4         NaN         NaN  5 or more years   

  How did you initially learn of the Garden Awards Program?  \
0              I nominated my garden to be evaluated          
1                                                NaN          
2              I nominated my garden to be evaluated          
3                                                NaN          
4                                                NaN          

                                         Unnamed: 14  etc...
0  A friend or family member nominated my garden ...  
1  A friend or family member nominated my garden ...  
2                                                NaN  
3                                                NaN  
4                                                NaN  

此问题How long have you been participating in the Garden Awards Program?具有有效回复:one year2-3 years等,并且都在第一行中找到,作为哪个列包含哪个值的键。这是第一个问题。 (同样适用于How did you initially learn of the Garden Awards Program?,其中有效回复为:I nominated my garden to be evaluatedA friend or family member nominated my garden等。)

第二个问题是,每个分类响应的附加列都是Unnamed: N,其中N是与所有问题相关联的类别的列数。

在每个问题开始重新映射和展平/折叠列到一个列之前,我想知道是否有任何其他方式来处理使用Pandas这样呈现的调查数据。我的所有搜索都指向了SurveyMonkey API,但我不知道它是如何有用的。

我猜我需要压扁列,因此,如果有人可以推荐一种方法,那就太棒了。我认为有一种方法可以通过抓住相邻的列来抓住属于分类响应的所有列,直到Unnamed不再出现在列名中,但我无法如何做到这一点。

1 个答案:

答案 0 :(得分:2)

我将使用以下DataFrame(可以从here下载为CSV):

     Q1 Unnamed: 2 Unnamed: 3    Q2 Unnamed: 5 Unnamed: 6    Q3 Unnamed: 7 Unnamed: 8
0  A1-A       A1-B       A1-C  A2-A       A2-B       A2-C  A3-A       A4-B       A3-C
1  A1-A        NaN        NaN   NaN       A2-B        NaN   NaN        NaN       A3-C
2   NaN       A1-B        NaN  A2-A        NaN        NaN   NaN       A4-B        NaN
3   NaN        NaN       A1-C   NaN       A2-B        NaN  A3-A        NaN        NaN
4   NaN       A1-B        NaN   NaN        NaN       A2-C   NaN        NaN       A3-C
5  A1-A        NaN        NaN   NaN       A2-B        NaN  A3-A        NaN        NaN

主要假设:

  1. 每个名称不以Unnamed开头的列实际上是问题的标题
  2. 问题标题之间的列表示列间隔左端的问题选项
  3. 解决方案概述:

    1. 查找每个问题开始和结束的指标
    2. 将每个问题展平为一列(pd.Series
    3. 将问题列合并在一起
    4. 实施(第1部分):

      indices = [i for i, c in enumerate(df.columns) if not c.startswith('Unnamed')]
      questions = [c for c in df.columns if not c.startswith('Unnamed')]
      slices = [slice(i, j) for i, j in zip(indices, indices[1:] + [None])]
      

      你可以看到在下面的切片上进行迭代,你会得到一个与每个问题相对应的DataFrame

      for q in slices:
          print(df.iloc[:, q])  # Use `display` if using Jupyter
      

      实施(第2-3部分):

      def parse_response(s):
          try:
              return s[~s.isnull()][0]
          except IndexError:
              return np.nan
      
      data = [df.iloc[:, q].apply(parse_response, axis=1)[1:] for q in slices]
      df = pd.concat(data, axis=1)
      df.columns = questions
      

      输出:

           Q1    Q2    Q3
      1  A1-A  A2-B  A3-C
      2  A1-B  A2-A  A4-B
      3  A1-C  A2-B  A3-A
      4  A1-B  A2-C  A3-C
      5  A1-A  A2-B  A3-A