Python:在数据框集合中拆分,整数为单个数字列

时间:2017-10-05 13:35:18

标签: python dictionary

我在python词典中有两个数据框的集合。每个数据框都有一个由0和1组合而成的字符串列。此外,字符串的长度因长度是该月的天数而变化。

我的问题是我无法弄清楚如何将字符串列拆分为多个,以便在每列中只有一个或零或缺少值。

我看过这个帖子,建议可以使用list(map(int(i) for i in str(01111001))将单个数字拆分成数字。

但是,我怎么能将下面字典中的col Holiday分成许多列,这样如果特定记录较短,每列只包含一个或零或缺少值。

 'ATM':
  Plant           Year    Month Holiday
  01               1996   Mar   '01111001'
  02               1997   Feb   '0111011'
  SP               1996   Mar   '01100111'
  BE               1999   Mar   '00111111'

'FDA':
 Plant           Year     Month Holiday
  01               2001   Mar    '01111101'
  02               2002   Mar    '11110110' 
  SP               2001   Apr    '1110011' 
  BE               2002   June   '10111100' 

我想要达到的结果如下:

 'ATM':
  Plant           Year    Month H1 H2 H3 H4 H5 H6 H7 H8
  01               1996   Mar   0  1  1  1  1  0  0  1
  02               1997   Feb   0  1  1  1  0  1  1  NA 
  SP               1996   Mar   0  1  1  0  0  1  1  1 
  BE               1999   Mar   0  0  1  1  1  1  1  1

'FDA':
 Plant           Year     Month  H1 H2 H3 H4 H5 H6 H7 H8
  01               2001   Mar    0  1  1  1  1  1  0  1
  02               2002   Mar    1  1  1  1  0  1  1  0 
  SP               2001   Apr    1  1  1  0  0  1  1  NA
  BE               2002   June   1  0  1  1  1  1  0  0 

1 个答案:

答案 0 :(得分:1)

我创建了一个测试代码,打印出你想要的东西。我们的想法是使用numpy str矩阵来存储值。矩阵充满了“NA”,因此它们最终会在那里。比这个技巧使用广播来复制所需位置的值。通过连接和删除不需要的列来完成整个数据框。代码遍历字典的键。我假设您正在使用pandas数据帧,并且加载的二进制值被解释为object s。

代码的第一部分是构造数据帧字典的标题。

import pandas as pd
import numpy as np

## Lets call it "header"

from io import StringIO

df_0 = """
Plant;Year;Month;Holiday
01;1996;Mar;01111001
02;1997;Feb;0111011
SP;1996;Mar;01100111
BE;1999;Mar;00111111
"""

df_1 = """
Plant;Year;Month;Holiday
01;2001;Mar;01111101
02;2002;Mar;11110110
SP;2001;Apr;1110011
BE;2002;June;10111100
"""

df_0 = pd.read_csv(StringIO(df_0), sep=";", dtype=object);
df_1 = pd.read_csv(StringIO(df_1), sep=";", dtype=object);

df = { "ATM": df_0, "PDE": df_1 }

## "Header" end 

MAX_SIZE = 8

for k in df:
  ldf = df[k]
  rows = ldf.shape[1]

  # Here I create a matrix that will contain my required values "NA"
  nmat = np.full((rows, MAX_SIZE), "NA")

  for i in range(rows):
      # I'm using the same conversion that I suggested you in 
      # the comments
      ary = np.array([v for v in ldf["Holiday"][i]])
      # Copying only the needed part, in some cases the final
      # array is of len 7 instead of 8.
      nmat[i, 0:len(ary)] = ary

  # Creating a new dataframe that will be
  # concatenated by using the numpy array generated before.
  nframe = pd.DataFrame(nmat, 
             columns=["H" + str(i+1) for i in range(MAX_SIZE)])
  # Actual concatenation
  ldf = pd.concat([ldf, nframe], axis=1)
  # and deletion on "Holiday" columns
  del ldf["Holiday"]  # only if really needed, removes Holiday column
  # Substitution in the original array
  df[k] = ldf

# et voillà 
print(df)

吐出来:

{
  'ATM':   
    Plant  Year Month H1 H2 H3 H4 H5 H6 H7  H8
    0    01  1996   Mar  0  1  1  1  1  0  0   1
    1    02  1997   Feb  0  1  1  1  0  1  1  NA
    2    SP  1996   Mar  0  1  1  0  0  1  1   1
    3    BE  1999   Mar  0  0  1  1  1  1  1   1, 
  'PDE':   
    Plant  Year Month H1 H2 H3 H4 H5 H6 H7  H8
    0    01  2001   Mar  0  1  1  1  1  1  0   1
    1    02  2002   Mar  1  1  1  1  0  1  1   0
    2    SP  2001   Apr  1  1  1  0  0  1  1  NA
    3    BE  2002  June  1  0  1  1  1  1  0   0
}