假设我有一个看起来像这样的数据框:
REFERENCE_CODE
dog
1
2
3
4
cat
1
2
4
5
rat
3
4
5
fish
4
5
6
注意空格。.我想获得一个看起来像这样的数据框:
REFERENCE_CODE
dog
dog_1
dog_2
dog_3
dog_4
cat
cat_1
cat_2
cat_4
cat_5
rat
rat_3
rat_4
rat_5
fish
fish_4
fish_5
fish_6
我尝试了类似以下操作:
for index, row in df.iterrows():
if isinstance(row['REFERENCE_CODE'], str):
great! continue
elif isinstance(row['REFERENCE_CODE'], int):
go back up and find the last instance, concatenate
else:
pass
我无法填写有伪代码的区域。我的逻辑正确吗?有没有更简单的方法可以做到这一点?理想情况下,我想保留空白,大小等原始数据的完整性,但如果没有,那也是可以的。我会找到解决方法!谢谢。
根据安迪·海登(Andy Hayden):
Traceback (most recent call last):
Question number REFERENCE_CODE ... Unnamed: 12 Unnamed: 13
File "/Users/xxx/Projects/trend_env/src/script4.py", line 10, in <module>
0 Q1a ladder_now ... NaN NaN
1 NaN NaN ... NaN NaN
2 NaN 1 ... NaN NaN
headers = (df.REFERENCE_CODE != '') & ~df.REFERENCE_CODE.str.isnumeric()
3 NaN 2 ... NaN NaN
File "/Users/xxx/Projects/trend_env/lib/python3.7/site-packages/pandas/core/generic.py", line 1466, in __invert__
4 NaN 3 ... NaN NaN
arr = operator.inv(com.values_from_object(self))
TypeError: bad operand type for unary ~: 'float'
Question number REFERENCE_CODE ... Unnamed: 12 Unnamed: 13
0 Q1a ladder_now ... NaN NaN
1 NaN NaN ... NaN NaN
2 NaN 1 ... NaN NaN
3 NaN 2 ... NaN NaN
4 NaN 3 ... NaN NaN
[5 rows x 14 columns]
Traceback (most recent call last):
File "/Users/mitchell_bregman/Projects/trend_env/src/script4.py", line 14, in <module>
headers = (df.REFERENCE_CODE != '') & ~df.REFERENCE_CODE.str.isnumeric()
File "/Users/mitchell_bregman/Projects/trend_env/lib/python3.7/site-packages/pandas/core/generic.py", line 1466, in __invert__
arr = operator.inv(com.values_from_object(self))
TypeError: bad operand type for unary ~: 'float'
答案 0 :(得分:1)
要获取组,可以使用掩码和总和:
In [11]: headers = (df.REFERENCE_CODE != '') & ~df.REFERENCE_CODE.str.isnumeric()
In [12]: headers.cumsum()
Out[12]:
0 1
1 1
2 1
3 1
4 1
5 2
6 2
7 2
8 2
9 2
10 2
11 2
12 3
13 3
14 3
15 3
16 3
17 3
18 4
19 4
20 4
21 4
Name: REFERENCE_CODE, dtype: int64
现在您可以使用它来分组:
In [13]: res = df.groupby(headers.cumsum())['REFERENCE_CODE'].apply(lambda x: x.iloc[0] + '_' + x)
In [14]: res
Out[14]:
0 dog_dog
1 dog_1
2 dog_2
3 dog_3
4 dog_4
5 cat_cat
6 cat_1
7 cat_2
8 cat_
9 cat_4
10 cat_5
11 cat_
12 rat_rat
13 rat_
14 rat_3
15 rat_4
16 rat_5
17 rat_
18 fish_fish
19 fish_4
20 fish_5
21 fish_6
Name: REFERENCE_CODE, dtype: object
,仅使用相关(数字)列:
In [15]: df.REFERENCE_CODE.update(res[df.REFERENCE_CODE.str.isnumeric()])
In [16]: df
Out[16]:
REFERENCE_CODE
0 dog
1 dog_1
2 dog_2
3 dog_3
4 dog_4
5 cat
6 cat_1
7 cat_2
8
9 cat_4
10 cat_5
11
12 rat
13
14 rat_3
15 rat_4
16 rat_5
17
18 fish
19 fish_4
20 fish_5
21 fish_6
以这种方式转换可能会更容易...我认为这是一个奇怪的目标(在常规python中会容易一些)。
答案 1 :(得分:0)
您可以做的是沿着该系列应用一个函数,在函数上使用可变变量作为“缓存”。我假设您拥有的是以下值列表:
ls = ['dog', 1, 2, 3, 4, 'cat', 1, 2, '', 4, 5,
'rat', '', 3, 4, 5, '', 'fish', 4, 5, 6]
def append_string(x, last_string_value=['initial_string']):
if isinstance(x, str) or x is None:
if x:
last_string_value[0] = x
return x
else:
return last_string_value[0] + '_{}'.format(x)
print(list(map(append_string, ls)))
这将为您提供所需的结果。如果您拥有一个数据框,则可以在相应的系列中应用此功能,您将获得相同的效果。