我正在使用class Student:
def __init__(self, name, course, age):
self.name = name
self.course = course
self.age = age
def roomNumber(self):
if self.course == "Computing":
room = "S227"
elif self.course == "Art":
room = "Art Studio 1"
else:
room = "Main hall"
return (room)
def parientSign(self):
if self.age > 17:
print("Parent doesn't need to sign")
else:
print("Parent needs to sign")
return
def printStudent(self):
print("Student Name: ", self.name)
print("Student Course: ", self.course)
print("Your room number is: ", self.roomNumber())
print("Your Age is: ", self.age)
studentList = []
studentList.append(Student("Name One", "Computing", 18))
studentList.append(Student("Bob Smart", "Art", 19))
studentList.append(Student("Big Terry", "Computing", 16))
for student in studentList:
student.printStudent()
student.parientSign()
print("------------")
从现有DataFrame的值构造一系列元组。我需要在元组中构造值的特定顺序,并用apply()
替换除一列以外的所有NaN
。
以下功能可以产生所需的结果,但是执行速度很慢:
'{}'
原始DataFrame:
def build_insert_tuples_series(row):
# Here I attempt to handle ordering the final tuple
# I must also replace NaN with "{}" for all but v2 column.
vals = [row['v2']]
row_sans_v2 = row.drop(labels=['v2'])
row_sans_v2.fillna("{}", inplace=True)
res = [val for val in row_sans_token]
vals += res
return tuple(vals)
def generate_insert_values_series(df):
df['insert_vals'] = df.apply(lambda x: build_insert_tuples_series(x), axis=1)
return df['insert_vals']
调用 id v1 v2
0 1.0 foo quux
1 2.0 bar foo
2 NaN NaN baz
时得到的DataFrame:
最后一个元组的顺序逻辑为generate_insert_values_series(df)
(v2, ..all_other_columns..)
为函数计时以生成结果DataFrame:
id v1 v2 insert_vals
0 1.0 foo quux (quux, 1.0, foo)
1 2.0 bar foo (foo, 2.0, bar)
2 NaN NaN baz (baz, {}, {})
我认为可能有一种方法可以更有效地构建Series,但是不确定如何使用矢量化或其他方法来优化操作。
答案 0 :(得分:3)
zip
,get
,mask
,fillna
和sorted
物有所值的一支班轮
df.assign(
insert_vals=
[*zip(*map(df.mask(df.isna(), {}).get, sorted(df, key=lambda x: x != 'v2')))])
id v1 v2 insert_vals
0 1.0 foo quux (quux, 1.0, foo)
1 2.0 bar foo (foo, 2.0, bar)
2 NaN NaN baz (baz, {}, {})
少一口气
get = df.mask(df.isna(), {}).get
key = lambda x: x != 'v2'
cols = sorted(df, key=key)
df.assign(insert_vals=[*zip(*map(get, cols))])
id v1 v2 insert_vals
0 1.0 foo quux (quux, 1.0, foo)
1 2.0 bar foo (foo, 2.0, bar)
2 NaN NaN baz (baz, {}, {})
这应该适用于旧版python
get = df.mask(df.isna(), {}).get
key = lambda x: x != 'v2'
cols = sorted(df, key=key)
df.assign(insert_vals=zip(*map(get, cols)))
答案 1 :(得分:2)
首先,您可以使用numpy
将null
的值替换为dicts
import pandas as pd
import numpy as np
temp = pd.DataFrame({'id':[1,2, None], 'v1':['foo', 'bar', None], 'v2':['quux', 'foo', 'bar']})
def replace_na(col):
return np.where(temp[col].isnull(), '{}', temp[col])
def generate_tuple(df):
df['id'], df['v1'] = replace_na('id'), replace_na('v1')
return df.apply(lambda x: tuple([x['v2'], x['id'], x['v1']]), axis=1)
您的收获是
%%timeit
temp['insert_tuple'] = generate_tuple(temp)
>>>> 1000 loops, best of 3 : 1ms per loop
如果您将generate_tuple return
更改为类似的内容
def generate_tuple(df):
df['id'], df['v1'] = replace_na('id'), replace_na('v1')
return list(zip(df['v2'], df['id'], df['v1']))
您的收益变为:
%%timeit
temp['insert_tuple'] = generate_tuple(temp)
1000 loops, best of 3 : 674 µs per loop
答案 2 :(得分:2)
您不应该想要这样做,因为您的新系列将失去所有矢量化功能。
但是,如果必须的话,可以通过使用pd.DataFrame.itertuples
,列表推导或apply
来避免map
。唯一的麻烦是对列进行重新排序,您可以通过转换为list
来完成:
df = pd.concat([df]*10000, ignore_index=True)
col_lst = df.columns.tolist()
cols = [col_lst.pop(col_lst.index('v2'))] + col_lst
%timeit list(df[cols].itertuples(index=False)) # 31.3 ms per loop
%timeit [tuple(x) for x in df[cols].values] # 74 ms per loop
%timeit list(map(tuple, df[cols].values)) # 73 ms per loop
上面的基准测试是在Python 3.6.0上进行的,但是即使在2.7上,您也可能发现它们比apply
更有效。请注意,最终版本不需要list
转换,因为map
在v2.7中返回了list
。
如果绝对必要,则可以通过一系列fillna
:
s = pd.Series([{} for _ in range(len(df.index))], index=df.index)
for col in df[cols]:
df[cols].fillna(s)