Question

我正在使用class Student: def __init__(self, name, course, age): self.name = name self.course = course self.age = age def roomNumber(self): if self.course == "Computing": room = "S227" elif self.course == "Art": room = "Art Studio 1" else: room = "Main hall" return (room) def parientSign(self): if self.age > 17: print("Parent doesn't need to sign") else: print("Parent needs to sign") return def printStudent(self): print("Student Name: ", self.name) print("Student Course: ", self.course) print("Your room number is: ", self.roomNumber()) print("Your Age is: ", self.age) studentList = [] studentList.append(Student("Name One", "Computing", 18)) studentList.append(Student("Bob Smart", "Art", 19)) studentList.append(Student("Big Terry", "Computing", 16)) for student in studentList: student.printStudent() student.parientSign() print("------------")从现有DataFrame的值构造一系列元组。我需要在元组中构造值的特定顺序，并用apply()替换除一列以外的所有NaN。

以下功能可以产生所需的结果，但是执行速度很慢：

'{}'

原始DataFrame：

def build_insert_tuples_series(row):
    # Here I attempt to handle ordering the final tuple
    # I must also replace NaN with "{}" for all but v2 column.
    vals = [row['v2']]
    row_sans_v2 = row.drop(labels=['v2'])
    row_sans_v2.fillna("{}", inplace=True)
    res = [val for val in row_sans_token]
    vals += res
    return tuple(vals)

def generate_insert_values_series(df):
    df['insert_vals'] = df.apply(lambda x: build_insert_tuples_series(x), axis=1)
    return df['insert_vals']

调用id v1 v2 0 1.0 foo quux 1 2.0 bar foo 2 NaN NaN baz时得到的DataFrame：

最后一个元组的顺序逻辑为generate_insert_values_series(df)

(v2, ..all_other_columns..)

为函数计时以生成结果DataFrame：

    id   v1    v2       insert_vals
0  1.0  foo  quux  (quux, 1.0, foo)
1  2.0  bar   foo   (foo, 2.0, bar)
2  NaN  NaN   baz     (baz, {}, {})

我认为可能有一种方法可以更有效地构建Series，但是不确定如何使用矢量化或其他方法来优化操作。

Answer 1

`zip`，`get`，`mask`，`fillna`和`sorted`

物有所值的一支班轮

df.assign(
    insert_vals=
    [*zip(*map(df.mask(df.isna(), {}).get, sorted(df, key=lambda x: x != 'v2')))])

    id   v1    v2       insert_vals
0  1.0  foo  quux  (quux, 1.0, foo)
1  2.0  bar   foo   (foo, 2.0, bar)
2  NaN  NaN   baz     (baz, {}, {})

少一口气

get = df.mask(df.isna(), {}).get
key = lambda x: x != 'v2'
cols = sorted(df, key=key)

df.assign(insert_vals=[*zip(*map(get, cols))])

    id   v1    v2       insert_vals
0  1.0  foo  quux  (quux, 1.0, foo)
1  2.0  bar   foo   (foo, 2.0, bar)
2  NaN  NaN   baz     (baz, {}, {})

这应该适用于旧版python

get = df.mask(df.isna(), {}).get
key = lambda x: x != 'v2'
cols = sorted(df, key=key)

df.assign(insert_vals=zip(*map(get, cols)))

Answer 2

首先，您可以使用numpy将null的值替换为dicts

import pandas as pd
import numpy as np

temp = pd.DataFrame({'id':[1,2, None], 'v1':['foo', 'bar', None], 'v2':['quux', 'foo', 'bar']})

def replace_na(col): 
    return np.where(temp[col].isnull(), '{}', temp[col])

def generate_tuple(df):
    df['id'], df['v1'] = replace_na('id'), replace_na('v1')
    return df.apply(lambda x: tuple([x['v2'], x['id'], x['v1']]), axis=1)

您的收获是

%%timeit
temp['insert_tuple'] = generate_tuple(temp)
>>>> 1000 loops, best of 3 : 1ms per loop

如果您将generate_tuple return更改为类似的内容

def generate_tuple(df):
    df['id'], df['v1'] = replace_na('id'), replace_na('v1')
    return list(zip(df['v2'], df['id'], df['v1']))

您的收益变为：

%%timeit
temp['insert_tuple'] = generate_tuple(temp)
1000 loops, best of 3 : 674 µs per loop

Answer 3

您不应该想要这样做，因为您的新系列将失去所有矢量化功能。

但是，如果必须的话，可以通过使用pd.DataFrame.itertuples，列表推导或apply来避免map。唯一的麻烦是对列进行重新排序，您可以通过转换为list来完成：

df = pd.concat([df]*10000, ignore_index=True)

col_lst = df.columns.tolist()
cols = [col_lst.pop(col_lst.index('v2'))] + col_lst

%timeit list(df[cols].itertuples(index=False))  # 31.3 ms per loop
%timeit [tuple(x) for x in df[cols].values]     # 74 ms per loop
%timeit list(map(tuple, df[cols].values))       # 73 ms per loop

上面的基准测试是在Python 3.6.0上进行的，但是即使在2.7上，您也可能发现它们比apply更有效。请注意，最终版本不需要list转换，因为map在v2.7中返回了list。

如果绝对必要，则可以通过一系列fillna：

s = pd.Series([{} for _ in range(len(df.index))], index=df.index)

for col in df[cols]:
    df[cols].fillna(s)

从pandas DataFrame有效创建元组系列

3 个答案:

`zip`，`get`，`mask`，`fillna`和`sorted`

从pandas DataFrame有效创建元组系列

3 个答案:

zip，get，mask，fillna和sorted

`zip`，`get`，`mask`，`fillna`和`sorted`