我想在不使用分割符的情况下分割等长的字符串,并扩展数据框。
这是我正在使用的测试数据框:
sample1 = pd.DataFrame({
'TST': {1: 1535840000000, 2: 1535840000000},
'RCV': {1: 1535840000000, 2: 1535850000000},
'TCU': {1: 358272000000000, 2: 358272000000000},
'SPD': {1: '0', 2: '00000000000000710000007D007C00E2'}
})
如您所见,SPD
列包含各种长度的字符串,没有任何分隔符。
我想每4个字符将SPD
列拆分为新行,然后将其扩展到数据框。
TST RCV TCU SPD
0 1535840000000 1535840000000 358272000000000 0000
1 1535840000000 1535840000000 358272000000000 0000
2 1535840000000 1535840000000 358272000000000 0000
3 1535840000000 1535840000000 358272000000000 0071
4 1535840000000 1535840000000 358272000000000 0000
5 1535840000000 1535840000000 358272000000000 007D
6 1535840000000 1535840000000 358272000000000 007C
7 1535840000000 1535840000000 358272000000000 00E2
我尝试首先使用以下方法生成系列:
pd.concat([pd.Series(re.findall('....', row['SPD'])) for _, row in sample1.iterrows()]).reset_index()
给出
index 0
0 0 0000
1 1 0000
2 2 0000
3 3 0071
4 4 0000
5 5 007D
6 6 007C
7 7 00E2
但是我无法将其扩展回sample1
答案 0 :(得分:3)
您可以使用str.findall
,然后根据来自SPD的4个字符切片的数量,使用repeat
行。
from itertools import chain
spd4 = df.pop('SPD').str.findall(r'.{4}')
(pd.DataFrame(df.values.repeat(spd4.str.len(), axis=0), columns=df.columns)
.assign(SPD=list(chain.from_iterable(spd4))))
TST RCV TCU SPD
0 1535840000000 1535850000000 358272000000000 0000
1 1535840000000 1535850000000 358272000000000 0000
2 1535840000000 1535850000000 358272000000000 0000
3 1535840000000 1535850000000 358272000000000 0071
4 1535840000000 1535850000000 358272000000000 0000
5 1535840000000 1535850000000 358272000000000 007D
6 1535840000000 1535850000000 358272000000000 007C
7 1535840000000 1535850000000 358272000000000 00E2
答案 1 :(得分:2)
您可以使用str.findall
每SPD
个字符将4
中的字符串分割,然后从链接的解决方案中将unnesting
嵌套到结果数据框中:
sample1['SPD'] = sample1.SPD.str.ljust(4, '0').str.findall(r'.{4}?')
unnesting(sample1, ['SPD'])
SPD TST RCV TCU
1 0000 1535840000000 1535840000000 358272000000000
2 0000 1535840000000 1535850000000 358272000000000
2 0000 1535840000000 1535850000000 358272000000000
2 0000 1535840000000 1535850000000 358272000000000
2 0071 1535840000000 1535850000000 358272000000000
2 0000 1535840000000 1535850000000 358272000000000
2 007D 1535840000000 1535850000000 358272000000000
2 007C 1535840000000 1535850000000 358272000000000
2 00E2 1535840000000 1535850000000 358272000000000
答案 2 :(得分:1)
使用Series.str.extractall,然后加入原始df。
sample1.filter(regex='^(?!SPD)').join(
sample1.SPD.str.extractall('(?P<SPD>.{4})').reset_index(level=1, drop=True)
)
# TST RCV TCU SPD
#1 1535840000000 1535840000000 358272000000000 NaN
#2 1535840000000 1535850000000 358272000000000 0000
#2 1535840000000 1535850000000 358272000000000 0000
#2 1535840000000 1535850000000 358272000000000 0000
#2 1535840000000 1535850000000 358272000000000 0071
#2 1535840000000 1535850000000 358272000000000 0000
#2 1535840000000 1535850000000 358272000000000 007D
#2 1535840000000 1535850000000 358272000000000 007C
#2 1535840000000 1535850000000 358272000000000 00E2
如果要排除行数少于4个字符SPD
的行,请使用inner join(... how ='inner')。