熊猫:遍历每一行,提取功能并创建新列

时间:2017-08-01 21:52:24

标签: python python-2.7 pandas dataframe

我有一个数据框,其中包含一个名称不同的列。我从这些名称中提取特征并将它们存储到字典中。然后我想为每个功能创建一个列,并为每个名称存储值。我正在努力让我的循环正确。

我的代码:

import pandas as pd

data = pd.DataFrame(['Mike', 'Ester', 'Sarah'])
data.columns = ['name']

def get_features(name):
    features = {}
    features["firstletter"] = name[0].lower()
    features["lastletter"] = name[-1].lower()
    return features

for name in data['name']:
    features = get_features(name)
    print features
    for f,v in features.items():
        data[f] = v
data.head()

我明白了:

name    lastletter  firstletter
0   Mike    h   s
1   Ester   h   s
2   Sarah   h   s

我需要:

name    lastletter  firstletter
0   Mike    e   m
1   Ester   r   e
2   Sarah   h   s

我理解为什么所有名字都从姓氏中获取值,但我无法弄清楚如何修复它。我可能首先为所有功能创建新的标题,然后更新我的数据框,但我希望有一个更聪明的方法。非常感谢您的帮助!

编辑:我的功能比第一个/最后一个字母复杂得多。它包含大约20种不同的功能,所以我真的需要建立一个字典......

def get_features(name):
    features = {}
    features["firstletter"] = name[0].lower()
    features["lastletter"] = name[-1].lower()
    features["hythen"] = ("-" in name.lower())
    features["suffix"] = name[-2:].lower()
    features["prefix"] = name[0:2].lower()
    features["length"] = len(name)
    for letter in 'abcdefghijklmnopqrstuvwxyz':
        features["count(%s)" % letter] = name.lower().count(letter)
        features["has(%s)" % letter] = (letter in name.lower())
    return features

2 个答案:

答案 0 :(得分:3)

我这样做:

In [107]: data[['first_letter','last_letter']] = \
              data.name.str.lower().str.extract(r'^(.).*(.)$', expand=True)

In [108]: data
Out[108]:
    name first_letter last_letter
0   Mike            m           e
1  Ester            e           r
2  Sarah            s           h

<强>更新

In [127]: df.join(pd.DataFrame.from_records(df.apply(lambda x: get_features(x['name']),
                                                     axis=1).values, 
                                            index=df.index))
Out[127]:
    name  count(a)  count(b)  count(c)  count(d)  count(e)  count(f)  \
0   Mike         0         0         0         0         1         0
1  Ester         0         0         0         0         2         0
2  Sarah         2         0         0         0         0         0

   count(g)  count(h)  count(i)   ...    has(v)  has(w)  has(x)  has(y)  \
0         0         0         1   ...     False   False   False   False
1         0         0         0   ...     False   False   False   False
2         0         1         0   ...     False   False   False   False

   has(z)  hythen  lastletter  length  prefix  suffix
0   False   False           e       4      mi      ke
1   False   False           r       5      es      er
2   False   False           h       5      sa      ah

[3 rows x 59 columns]

答案 1 :(得分:2)

新答案

更改您的功能以返回pd.Series并仅lower执行一次。

def get_features(name):
    features = {}
    name = name.lower()
    features["firstletter"] = name[0]
    features["lastletter"] = name[-1]
    features["hythen"] = ("-" in name)
    features["suffix"] = name[-2:]
    features["prefix"] = name[0:2]
    features["length"] = len(name)
    for letter in 'abcdefghijklmnopqrstuvwxyz':
        features["count(%s)" % letter] = name.count(letter)
        features["has(%s)" % letter] = (letter in name)
    return pd.Series(features)

然后使用apply

data.join(data.name.apply(get_features))

    name  count(a)  count(b)  count(c)  count(d)  count(e)  count(f)  count(g)  count(h)  count(i)   ...    has(v)  has(w)  has(x)  has(y)  has(z)  hythen  lastletter  length  prefix  suffix
0   Mike         0         0         0         0         1         0         0         0         1   ...     False   False   False   False   False   False           e       4      mi      ke
1  Ester         0         0         0         0         2         0         0         0         0   ...     False   False   False   False   False   False           r       5      es      er
2  Sarah         2         0         0         0         0         0         0         1         0   ...     False   False   False   False   False   False           h       5      sa      ah

旧答案

data.assign(
    **data.name.str.lower().str.extract(
        '^(?P<firstletter>.).*(?P<lastletter>.)$', expand=True
    )
)

    name firstletter lastletter
0   Mike           m          e
1  Ester           e          r
2  Sarah           s          h