例如,我有一个如下数据框:
name eventlist
0 a [{'t': '1234', 'n': 'user_engagem1'},{'t': '2345', 'n': 'user_engagem2'},{'t': '3456', 'n': 'user_engagem3'}]
1 b [{'t': '2345', 'n': 'user_engagem4'},{'t': '1345', 'n': 'user_engagem5'},{'t': '1356', 'n': 'user_engagem6'},{'t': '1345', 'n': 'user_engagem5'},{'t': '1359', 'n': 'user_engagem6'}]
2 c [{'t': '1334', 'n': 'user_engagem3'},{'t': '2345', 'n': 'user_engagem4'},{'t': '3556', 'n': 'user_engagem2'}]
我用字符串re.findall试了一下,看来可行,结果是 ['1234','2345','3456'],但我无法将其应用于数据框
#code 1,apply to string successfully
str="[{'t': '1234', 'n': 'user_engagem'},{'t': '2345', 'n': 'user_engagem'},{'t': '3456', 'n': 'user_engagem'}]"
print(re.findall(r"t': '(.+?)', '", str))
#code 2,apply to dateframe doesn't work
df['t']=df['events'].str.findall(r"t': '(.+?)', '", df['events'])
print(list)
我想要得到类似
的结果 name eventlist
0 a ['1234', '2345', '3456']
1 b ['2345', '1345','1234','1356', '1356']
2 c ['1334', '2345', '3556']
甚至更好,我可以得到类似的结果
name t_first t_last
0 a 1234 3456
1 b 2345 1359
2 c 1334 3556
答案 0 :(得分:1)
您可以使用字符串ast.literal_eval
来转换字典列表,然后使用t
s通过key
来获取值:
import ast
out = []
for x in df.pop('eventlist'):
a = ast.literal_eval(x)
out.append([a[0].get('t'), a[-1].get('t')])
或使用re.findall
:
out = []
for x in df.pop('eventlist'):
a = re.findall(r"t': '(.+?)', '", x)
out.append([a[0], a[-1]])
print (out)
[['1234', '3456'], ['2345', '1359'], ['1334', '3556']]
然后将DataFrame
和join
创建为原始文件:
df = df.join(pd.DataFrame(out, columns=['t_first','t_last'], index=df.index))
print (df)
name t_first t_last
0 a 1234 3456
1 b 2345 1359
2 c 1334 3556
a = df.pop('eventlist').str.findall(r"t': '(.+?)'")
df = df.assign(t_first= a.str[0], t_last = a.str[-1])
答案 1 :(得分:1)
str.findall
需要一个参数:regex模式。
# Call `pop` here to remove the "events" column.
v = df.pop('eventlist').str.findall(r"t': '(.+?)'")
print(v)
0 [1234, 2345, 3456]
1 [2345, 1345, 1356, 1345, 1359]
2 [1334, 2345, 3556]
Name: events, dtype: object
然后可以将其加载到单独的列中:
# More efficient than assigning if done in-place.
df['t_first'] = v.str[0]
df['t_last'] = v.str[-1]
# Or, if you want to return a copy,
# df = df.assign(t_first=v.str[0], t_last=v.str[-1])
df
name t_first t_last
0 a 1234 3456
1 b 2345 1359
2 c 1334 3556
另一个更好的选择是使用re.compile
预编译您的模式并循环运行,从findall
结果中提取第一项和最后一项。
import re
p = re.compile(r"t': '(.+?)'")
out = []
for name, string in zip(df.name, df.pop('eventlist')):
a = p.findall(string)
out.append([name, a[0], a[-1]])
pd.DataFrame(out, columns=['name', 't_first','t_last'], index=df.index)
name t_first t_last
0 a 1234 3456
1 b 2345 1359
2 c 1334 3556
如果需要将它们转换为int,请将out.append([name, a[0], a[-1]])
替换为out.append([name, int(a[0]), int(a[-1])])
。
以上解决方案假定您将始终有多个比赛。如果可能只有一个匹配项或没有匹配项,则可以通过检查附加到count
的匹配项数目来修改解决方案。
p = re.compile(r"t': '(.+?)'")
out = []
for name, string in zip(df.name, df.pop('eventlist')):
first = second = np.nan
if pd.notna(string):
a = p.findall(string)
if len(a) > 0:
first = int(a[0])
second = int(a[-1]) if len(a) > 1 else second
out.append([name, first, second])
pd.DataFrame(out, columns=['name', 't_first','t_last'], index=df.index)
name t_first t_last
0 a 1234 3456
1 b 2345 1359
2 c 1334 3556
答案 2 :(得分:1)
df['eventlist'] = df['eventlist'].map(lambda x:[i['t'] for i in x])
df
name eventlist
0 a [1234, 2345, 3456]
1 b [2345, 1345, 1356, 1345, 1359]
2 c [1334, 2345, 3556]
df['t_first'] = df['eventlist'][0]
df['t_last']=df['eventlist'].map(lambda x:x[len(x)-1])
df = df[['name','t_first','t_last']]
df
name t_first t_last
0 a 1234 3456
1 b 2345 1359
2 c 3456 3556