我的数据在pandas数据框系列中,它是一串用逗号分隔的值,例如。
workid:1234, homeid:4567, schoolid: 6789
该字符串可能包含空值或多个值,并且id号的长度可能可变:
id_numbers
0
1 workid:1234
2 workid:2567, homeid:345, schoolid: 678
3 homeid:567, schoolid: 6789
我希望创建一个仅包含workid:xxxx
值的新系列“ workid”。
work_id_numbers
0
1 workid:1234
2 workid:2567
3
我尝试过
df['id_list'] = df['id_numbers'].str.split(",")
创建列表
id_list
0
1 [workid:1234]
2 [workid:2567, homeid:345, schoolid: 678]
3 [homeid:567, schoolid: 6789]
我尝试遍历列表以提取workid:xxx
值
for num in df['id_list']:
if num.str.contains("workid", na=False) == True:
df['work_id_number'] = num
但是我有一个错误:
AttributeError: 'float' object has no attribute 'str'
令我惊讶的是,可能会有多种方法可以解决这个问题,因此我愿意对自己的方法进行更正或选择其他方法。
答案 0 :(得分:2)
df['id_list'] = df['id_numbers'].str.extract("(workid[^\,]*)").fillna("")
# output
id_numbers id_list
0 workid:1234 workid:1234
1 workid:2567, homeid:345, schoolid: 678 workid:2567
2 homeid:567, schoolid: 6789
答案 1 :(得分:1)
具有列表理解力的解决方案:
df['id_list'] = [','.join(y for y in x.split(", ") if y.startswith('workid'))
for x in df['id_numbers'].fillna('')]
print (df)
id_numbers id_list
0 NaN
1 workid:1234 workid:1234
2 workid:2567, homeid:345, schoolid: 678 workid:2567
3 homeid:567, schoolid: 6789