我有一个数据框,其中包含一个名称不同的列。我从这些名称中提取特征并将它们存储到字典中。然后我想为每个功能创建一个列,并为每个名称存储值。我正在努力让我的循环正确。
我的代码:
import pandas as pd
data = pd.DataFrame(['Mike', 'Ester', 'Sarah'])
data.columns = ['name']
def get_features(name):
features = {}
features["firstletter"] = name[0].lower()
features["lastletter"] = name[-1].lower()
return features
for name in data['name']:
features = get_features(name)
print features
for f,v in features.items():
data[f] = v
data.head()
我明白了:
name lastletter firstletter
0 Mike h s
1 Ester h s
2 Sarah h s
我需要:
name lastletter firstletter
0 Mike e m
1 Ester r e
2 Sarah h s
我理解为什么所有名字都从姓氏中获取值,但我无法弄清楚如何修复它。我可能首先为所有功能创建新的标题,然后更新我的数据框,但我希望有一个更聪明的方法。非常感谢您的帮助!
编辑:我的功能比第一个/最后一个字母复杂得多。它包含大约20种不同的功能,所以我真的需要建立一个字典......
def get_features(name):
features = {}
features["firstletter"] = name[0].lower()
features["lastletter"] = name[-1].lower()
features["hythen"] = ("-" in name.lower())
features["suffix"] = name[-2:].lower()
features["prefix"] = name[0:2].lower()
features["length"] = len(name)
for letter in 'abcdefghijklmnopqrstuvwxyz':
features["count(%s)" % letter] = name.lower().count(letter)
features["has(%s)" % letter] = (letter in name.lower())
return features
答案 0 :(得分:3)
我这样做:
In [107]: data[['first_letter','last_letter']] = \
data.name.str.lower().str.extract(r'^(.).*(.)$', expand=True)
In [108]: data
Out[108]:
name first_letter last_letter
0 Mike m e
1 Ester e r
2 Sarah s h
<强>更新强>
In [127]: df.join(pd.DataFrame.from_records(df.apply(lambda x: get_features(x['name']),
axis=1).values,
index=df.index))
Out[127]:
name count(a) count(b) count(c) count(d) count(e) count(f) \
0 Mike 0 0 0 0 1 0
1 Ester 0 0 0 0 2 0
2 Sarah 2 0 0 0 0 0
count(g) count(h) count(i) ... has(v) has(w) has(x) has(y) \
0 0 0 1 ... False False False False
1 0 0 0 ... False False False False
2 0 1 0 ... False False False False
has(z) hythen lastletter length prefix suffix
0 False False e 4 mi ke
1 False False r 5 es er
2 False False h 5 sa ah
[3 rows x 59 columns]
答案 1 :(得分:2)
新答案
更改您的功能以返回pd.Series
并仅lower
执行一次。
def get_features(name):
features = {}
name = name.lower()
features["firstletter"] = name[0]
features["lastletter"] = name[-1]
features["hythen"] = ("-" in name)
features["suffix"] = name[-2:]
features["prefix"] = name[0:2]
features["length"] = len(name)
for letter in 'abcdefghijklmnopqrstuvwxyz':
features["count(%s)" % letter] = name.count(letter)
features["has(%s)" % letter] = (letter in name)
return pd.Series(features)
然后使用apply
data.join(data.name.apply(get_features))
name count(a) count(b) count(c) count(d) count(e) count(f) count(g) count(h) count(i) ... has(v) has(w) has(x) has(y) has(z) hythen lastletter length prefix suffix
0 Mike 0 0 0 0 1 0 0 0 1 ... False False False False False False e 4 mi ke
1 Ester 0 0 0 0 2 0 0 0 0 ... False False False False False False r 5 es er
2 Sarah 2 0 0 0 0 0 0 1 0 ... False False False False False False h 5 sa ah
旧答案
data.assign(
**data.name.str.lower().str.extract(
'^(?P<firstletter>.).*(?P<lastletter>.)$', expand=True
)
)
name firstletter lastletter
0 Mike m e
1 Ester e r
2 Sarah s h