我正在尝试将代码映射到块。这些块由数值范围定义,例如AAA0-AAA9将包含代码AAA0,AAA1,AAA2等。范围可以变化但可以例如在列表中定义。我很感激帮助将pandas数据帧中的代码转换为各自的块。
参见示例启动数据帧:
d = {'ID': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10], 'Code': [
'AAA1', 'AAA2', 'AAA3', 'AAA4', 'AAA5', 'CCC2', 'AAA7', 'AAA9', 'BBB1', 'BBB2']}
df = pd.DataFrame(data=d)
查看示例所需数据框(使用块' AAA0-9',' CCC5-9',' BBB0-5'):
d = {'ID': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10], 'Code': [
'AAA0-9', 'AAA0-9', 'AAA0-9', 'AAA0-9', 'AAA0-9', 'CCC5-9', 'AAA0-9', 'AAA0-9', 'BBB0-5', 'BBB0-5']}
df = pd.DataFrame(data=d)
编辑:附加代码。与上述概念相同,但可能存在多个适用的块。
d = {'ID': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10], 'Code': ['AAA1 AAA2 AAA3', 'AAA2', 'AAA3 AAA9', 'AAA4', 'AAA5', 'CCC2 CCC3', 'AAA7',
'AAA9', 'BBB1', 'BBB2']}
df = pd.DataFrame(data=d)
答案 0 :(得分:5)
使用map
与dictionary
创建的前3个值的indexing with str:
d = {'AAA':'AAA0-9', 'CCC':'CCC5-9', 'BBB':'BBB0-5'}
#or generate dict from list
#L = ['AAA0-9', 'CCC5-9', 'BBB0-5']
#d = {x[:3]:x for x in L}
df['Code'] = df['Code'].str[:3].map(d)
print (df)
Code ID
0 AAA0-9 1
1 AAA0-9 2
2 AAA0-9 3
3 AAA0-9 4
4 AAA0-9 5
5 CCC5-9 6
6 AAA0-9 7
7 AAA0-9 8
8 BBB0-5 9
9 BBB0-5 10
<强>详细强>:
print (df['Code'].str[:3])
0 AAA
1 AAA
2 AAA
3 AAA
4 AAA
5 CCC
6 AAA
7 AAA
8 BBB
9 BBB
Name: Code, dtype: object
编辑:
如果还需要扩展价值观:
a = df.Code.str.split()
b = np.repeat(df.ID.values, a.str.len())
c = np.concatenate(a.values)
d = {'AAA':'AAA0-9', 'CCC':'CCC5-9', 'BBB':'BBB0-5'}
df = pd.DataFrame({'Code':c, 'ID':b})
print (df)
Code ID
0 AAA1 1
1 AAA2 1
2 AAA3 1
3 AAA2 2
4 AAA3 3
5 AAA9 3
6 AAA4 4
7 AAA5 5
8 CCC2 6
9 CCC3 6
10 AAA7 7
11 AAA9 8
12 BBB1 9
13 BBB2 10
df['Code'] = df['Code'].str[:3].map(d)
print (df)
Code ID
0 AAA0-9 1
1 AAA0-9 1
2 AAA0-9 1
3 AAA0-9 2
4 AAA0-9 3
5 AAA0-9 3
6 AAA0-9 4
7 AAA0-9 5
8 CCC5-9 6
9 CCC5-9 6
10 AAA0-9 7
11 AAA0-9 8
12 BBB0-5 9
13 BBB0-5 10
此外,如果不需要更改格式:
df = (df.set_index('ID')['Code']
.str.split(expand=True)
.stack()
.str[:3]
.map(d)
.groupby(level=0)
.apply(' '.join)
.reset_index(name='Code'))
print (df)
ID Code
0 1 AAA0-9 AAA0-9 AAA0-9
1 2 AAA0-9
2 3 AAA0-9 AAA0-9
3 4 AAA0-9
4 5 AAA0-9
5 6 CCC5-9 CCC5-9
6 7 AAA0-9
7 8 AAA0-9
8 9 BBB0-5
9 10 BBB0-5
EDIT1:
如果需要按范围生成字典:
L = ['AAA0-9', 'CCC2-9', 'BBB0-5']
d = (pd.Series(L, index=L)
.str.extract('(?P<a>\D+)(?P<b>\d)-(?P<c>\d+)', expand=True)
.set_index('a', append=True)
.astype(int)
.apply(lambda x: pd.Series(range(x.b, x.c + 1)), axis=1)
.stack()
.astype(int)
.astype(str)
.reset_index(name='d')
.assign(a=lambda x: x.a + x.d)
.rename(columns={'level_0':'e'})
.set_index('a')['e']
.to_dict()
)
print (d)
{'BBB1': 'BBB0-5', 'CCC6': 'CCC2-9', 'CCC2': 'CCC2-9',
'BBB4': 'BBB0-5', 'CCC5': 'CCC2-9', 'BBB2': 'BBB0-5',
'CCC4': 'CCC2-9', 'AAA4': 'AAA0-9', 'BBB0': 'BBB0-5',
'AAA9': 'AAA0-9', 'BBB3': 'BBB0-5', 'CCC3': 'CCC2-9',
'AAA0': 'AAA0-9', 'AAA3': 'AAA0-9', 'CCC9': 'CCC2-9',
'AAA2': 'AAA0-9', 'BBB5': 'BBB0-5', 'AAA1': 'AAA0-9',
'CCC8': 'CCC2-9', 'CCC7': 'CCC2-9', 'AAA8': 'AAA0-9',
'AAA7': 'AAA0-9', 'AAA5': 'AAA0-9', 'AAA6': 'AAA0-9'}
df['Code'] = df['Code'].map(d)
答案 1 :(得分:1)
容纳任意长度代码的简便方法:
df.Code.str.extract('(\D+)', expand=False)
0 AAA
1 AAA
2 AAA
3 AAA
4 AAA
5 CCC
6 AAA
7 AAA
8 BBB
9 BBB
Name: Code, dtype: object
您甚至可以方便地命名列
df.Code.str.extract('(?P<Block>\D+)(?P<Num>\d+)', expand=True)
Block Num
0 AAA 1
1 AAA 2
2 AAA 3
3 AAA 4
4 AAA 5
5 CCC 2
6 AAA 7
7 AAA 9
8 BBB 1
9 BBB 2