从pandas DataFrame有效创建元组系列

时间:2018-10-17 15:04:18

标签: python python-2.7 pandas

我正在使用class Student: def __init__(self, name, course, age): self.name = name self.course = course self.age = age def roomNumber(self): if self.course == "Computing": room = "S227" elif self.course == "Art": room = "Art Studio 1" else: room = "Main hall" return (room) def parientSign(self): if self.age > 17: print("Parent doesn't need to sign") else: print("Parent needs to sign") return def printStudent(self): print("Student Name: ", self.name) print("Student Course: ", self.course) print("Your room number is: ", self.roomNumber()) print("Your Age is: ", self.age) studentList = [] studentList.append(Student("Name One", "Computing", 18)) studentList.append(Student("Bob Smart", "Art", 19)) studentList.append(Student("Big Terry", "Computing", 16)) for student in studentList: student.printStudent() student.parientSign() print("------------") 从现有DataFrame的值构造一系列元组。我需要在元组中构造值的特定顺序,并用apply()替换除一列以外的所有NaN

以下功能可以产生所需的结果,但是执行速度很慢:

'{}'

原始DataFrame:

def build_insert_tuples_series(row):
    # Here I attempt to handle ordering the final tuple
    # I must also replace NaN with "{}" for all but v2 column.
    vals = [row['v2']]
    row_sans_v2 = row.drop(labels=['v2'])
    row_sans_v2.fillna("{}", inplace=True)
    res = [val for val in row_sans_token]
    vals += res
    return tuple(vals)

def generate_insert_values_series(df):
    df['insert_vals'] = df.apply(lambda x: build_insert_tuples_series(x), axis=1)
    return df['insert_vals']

调用 id v1 v2 0 1.0 foo quux 1 2.0 bar foo 2 NaN NaN baz 时得到的DataFrame:

最后一个元组的顺序逻辑为generate_insert_values_series(df)

(v2, ..all_other_columns..)

为函数计时以生成结果DataFrame:

    id   v1    v2       insert_vals
0  1.0  foo  quux  (quux, 1.0, foo)
1  2.0  bar   foo   (foo, 2.0, bar)
2  NaN  NaN   baz     (baz, {}, {})

我认为可能有一种方法可以更有效地构建Series,但是不确定如何使用矢量化或其他方法来优化操作。

3 个答案:

答案 0 :(得分:3)

zipgetmaskfillnasorted

物有所值的一支班轮

df.assign(
    insert_vals=
    [*zip(*map(df.mask(df.isna(), {}).get, sorted(df, key=lambda x: x != 'v2')))])

    id   v1    v2       insert_vals
0  1.0  foo  quux  (quux, 1.0, foo)
1  2.0  bar   foo   (foo, 2.0, bar)
2  NaN  NaN   baz     (baz, {}, {})

少一口气

get = df.mask(df.isna(), {}).get
key = lambda x: x != 'v2'
cols = sorted(df, key=key)

df.assign(insert_vals=[*zip(*map(get, cols))])

    id   v1    v2       insert_vals
0  1.0  foo  quux  (quux, 1.0, foo)
1  2.0  bar   foo   (foo, 2.0, bar)
2  NaN  NaN   baz     (baz, {}, {})

这应该适用于旧版python

get = df.mask(df.isna(), {}).get
key = lambda x: x != 'v2'
cols = sorted(df, key=key)

df.assign(insert_vals=zip(*map(get, cols)))

答案 1 :(得分:2)

首先,您可以使用numpynull的值替换为dicts

import pandas as pd
import numpy as np

temp = pd.DataFrame({'id':[1,2, None], 'v1':['foo', 'bar', None], 'v2':['quux', 'foo', 'bar']})

def replace_na(col): 
    return np.where(temp[col].isnull(), '{}', temp[col])

def generate_tuple(df):
    df['id'], df['v1'] = replace_na('id'), replace_na('v1')
    return df.apply(lambda x: tuple([x['v2'], x['id'], x['v1']]), axis=1)

您的收获是

%%timeit
temp['insert_tuple'] = generate_tuple(temp)
>>>> 1000 loops, best of 3 : 1ms per loop

如果您将generate_tuple return更改为类似的内容

def generate_tuple(df):
    df['id'], df['v1'] = replace_na('id'), replace_na('v1')
    return list(zip(df['v2'], df['id'], df['v1']))

您的收益变为:

%%timeit
temp['insert_tuple'] = generate_tuple(temp)
1000 loops, best of 3 : 674 µs per loop

答案 2 :(得分:2)

您不应该想要这样做,因为您的新系列将失去所有矢量化功能。

但是,如果必须的话,可以通过使用pd.DataFrame.itertuples,列表推导或apply来避免map。唯一的麻烦是对列进行重新排序,您可以通过转换为list来完成:

df = pd.concat([df]*10000, ignore_index=True)

col_lst = df.columns.tolist()
cols = [col_lst.pop(col_lst.index('v2'))] + col_lst

%timeit list(df[cols].itertuples(index=False))  # 31.3 ms per loop
%timeit [tuple(x) for x in df[cols].values]     # 74 ms per loop
%timeit list(map(tuple, df[cols].values))       # 73 ms per loop

上面的基准测试是在Python 3.6.0上进行的,但是即使在2.7上,您也可能发现它们比apply更有效。请注意,最终版本不需要list转换,因为map在v2.7中返回了list

如果绝对必要,则可以通过一系列fillna

s = pd.Series([{} for _ in range(len(df.index))], index=df.index)

for col in df[cols]:
    df[cols].fillna(s)