我在python词典中有两个数据框的集合。每个数据框都有一个由0和1组合而成的字符串列。此外,字符串的长度因长度是该月的天数而变化。
我的问题是我无法弄清楚如何将字符串列拆分为多个,以便在每列中只有一个或零或缺少值。
我看过这个帖子,建议可以使用list(map(int(i) for i in str(01111001))
将单个数字拆分成数字。
但是,我怎么能将下面字典中的col Holiday分成许多列,这样如果特定记录较短,每列只包含一个或零或缺少值。
'ATM':
Plant Year Month Holiday
01 1996 Mar '01111001'
02 1997 Feb '0111011'
SP 1996 Mar '01100111'
BE 1999 Mar '00111111'
'FDA':
Plant Year Month Holiday
01 2001 Mar '01111101'
02 2002 Mar '11110110'
SP 2001 Apr '1110011'
BE 2002 June '10111100'
我想要达到的结果如下:
'ATM':
Plant Year Month H1 H2 H3 H4 H5 H6 H7 H8
01 1996 Mar 0 1 1 1 1 0 0 1
02 1997 Feb 0 1 1 1 0 1 1 NA
SP 1996 Mar 0 1 1 0 0 1 1 1
BE 1999 Mar 0 0 1 1 1 1 1 1
'FDA':
Plant Year Month H1 H2 H3 H4 H5 H6 H7 H8
01 2001 Mar 0 1 1 1 1 1 0 1
02 2002 Mar 1 1 1 1 0 1 1 0
SP 2001 Apr 1 1 1 0 0 1 1 NA
BE 2002 June 1 0 1 1 1 1 0 0
答案 0 :(得分:1)
我创建了一个测试代码,打印出你想要的东西。我们的想法是使用numpy
str
矩阵来存储值。矩阵充满了“NA”,因此它们最终会在那里。比这个技巧使用广播来复制所需位置的值。通过连接和删除不需要的列来完成整个数据框。代码遍历字典的键。我假设您正在使用pandas
数据帧,并且加载的二进制值被解释为object
s。
代码的第一部分是构造数据帧字典的标题。
import pandas as pd
import numpy as np
## Lets call it "header"
from io import StringIO
df_0 = """
Plant;Year;Month;Holiday
01;1996;Mar;01111001
02;1997;Feb;0111011
SP;1996;Mar;01100111
BE;1999;Mar;00111111
"""
df_1 = """
Plant;Year;Month;Holiday
01;2001;Mar;01111101
02;2002;Mar;11110110
SP;2001;Apr;1110011
BE;2002;June;10111100
"""
df_0 = pd.read_csv(StringIO(df_0), sep=";", dtype=object);
df_1 = pd.read_csv(StringIO(df_1), sep=";", dtype=object);
df = { "ATM": df_0, "PDE": df_1 }
## "Header" end
MAX_SIZE = 8
for k in df:
ldf = df[k]
rows = ldf.shape[1]
# Here I create a matrix that will contain my required values "NA"
nmat = np.full((rows, MAX_SIZE), "NA")
for i in range(rows):
# I'm using the same conversion that I suggested you in
# the comments
ary = np.array([v for v in ldf["Holiday"][i]])
# Copying only the needed part, in some cases the final
# array is of len 7 instead of 8.
nmat[i, 0:len(ary)] = ary
# Creating a new dataframe that will be
# concatenated by using the numpy array generated before.
nframe = pd.DataFrame(nmat,
columns=["H" + str(i+1) for i in range(MAX_SIZE)])
# Actual concatenation
ldf = pd.concat([ldf, nframe], axis=1)
# and deletion on "Holiday" columns
del ldf["Holiday"] # only if really needed, removes Holiday column
# Substitution in the original array
df[k] = ldf
# et voillà
print(df)
吐出来:
{
'ATM':
Plant Year Month H1 H2 H3 H4 H5 H6 H7 H8
0 01 1996 Mar 0 1 1 1 1 0 0 1
1 02 1997 Feb 0 1 1 1 0 1 1 NA
2 SP 1996 Mar 0 1 1 0 0 1 1 1
3 BE 1999 Mar 0 0 1 1 1 1 1 1,
'PDE':
Plant Year Month H1 H2 H3 H4 H5 H6 H7 H8
0 01 2001 Mar 0 1 1 1 1 1 0 1
1 02 2002 Mar 1 1 1 1 0 1 1 0
2 SP 2001 Apr 1 1 1 0 0 1 1 NA
3 BE 2002 June 1 0 1 1 1 1 0 0
}