使用pandas提取我需要的数据

时间:2016-10-27 00:42:07

标签: python excel pandas

我有一个看起来像这样的xlsx文件;

Name     01/09/16        02/09/16          03/09/16       
Jack        In            Out                 In          
Lisa        Out           In                  Out             
Tom         Out           In                  In  

我尝试使用pandas在以下表格中打印出这些数据;

+----------------------------------+-------------+-------------+-------------+
|               Status             |  01/09/16   |  02/09/16   |    03/09/16 |
+----------------------------------+-------------+-------------+-------------+
|               In                 |  Jack          Tom             Tom
                                                 |  Lisa       |    Jack     |
+----------------------------------+-------------+-------------+-------------+
|               Out                |  Lisa
                                      Tom        |  Jack       |    Lisa     |
+----------------------------------+-------------+-------------+-------------+

我正努力想办法与熊猫一起做这件事。我想询问是否有任何简单的方法来迭代日期列,将其与行匹配并获取单元格值?

例如,让我们从第一列01/09/16开始,如何使用pandas向下移动该列并找到单元格值' In',将其与行名称匹配'杰克'然后将其添加到这样的嵌套字典中;

dictionary = {'01/09/16': {In: [Jack], Out: [Lisa, Tom] } }

如果我可以这样做,我可以使用PrettyTable之类的东西在表格中组织它,就像它在上面的第二个表格中所示。

2 个答案:

答案 0 :(得分:3)

考虑在数据框的所有系列列中运行的字典理解。但首先,请确保将 Name 设为dataframe的索引:

from io import StringIO
import pandas as pd

data = '''
Name     01/09/16        02/09/16          03/09/16       
Jack        In            Out                 In          
Lisa        Out           In                  Out             
Tom         Out           In                  In
'''
df = pd.read_table(StringIO(data), sep="\s+", index_col=0)
print(df)

#      01/09/16 02/09/16 03/09/16
# Name                           
# Jack       In      Out       In
# Lisa      Out       In      Out
# Tom       Out       In       In

# BUILD DICTIONARY
dfdict = {col: (df[col][df[col] == 'In'].index.values,
                df[col][df[col] == 'Out'].index.values) for col in df.columns}
dfdict['Status'] = ['In', 'Out']

# CAST TO DATAFRAME 
finaldf = pd.DataFrame(dfdict)
finaldf = finaldf[['Status'] + [col for col in df.columns]]             # RE-ORDER COLS
print(finaldf)

#   Status     01/09/16     02/09/16     03/09/16
# 0     In       [Jack]  [Lisa, Tom]  [Jack, Tom]
# 1    Out  [Lisa, Tom]       [Jack]       [Lisa]

答案 1 :(得分:2)

IIUC

pd.melt(
    df, id_vars=['Name'], value_vars=df.columns[1:].tolist(),
    value_name='Status', var_name='Date'
).set_index(['Status', 'Date']).groupby(level=[0, 1]).Name.apply(list).unstack()

enter image description here

或使用更少的代码

df.set_index('Name').unstack().reset_index().groupby(['level_0', 0]) \
    .Name.apply(list).rename_axis([None, None]).unstack(0)

enter image description here