我正在为SciKit格式化一个DataFrame。学习PCA看起来像这样:
datetime | mood | activities | notes
8/27/2017 | "good" | ["friends", "party", "gaming"] | NaN
8/28/2017 | "meh" | ["work", "friends", "good food"] | "Stuff stuff"
8/29/2017 | "bad" | ["work", "travel"] | "Fell off my bike"
...等等
我想将其转换为此,我认为这对于ML工作会更好:
datetime | mood | friends | party | gaming | work | good food | travel | notes
8/27/2017 | "good" | True | True | True | False | False | False | NaN
8/28/2017 | "meh" | True | False | False | True | True | False | "Stuff stuff"
8/29.2017 | "bad" | False | False | False | False | True | False | True | "Fell off my bike"
我已经尝试了here概述的方法,该方法只为我提供了所有活动的左对齐矩阵。这些列没有意义。如果我尝试将columns
传递给DataFrame
构造函数,则会收到错误消息:“传递了26列,传递的数据有9列。我相信这是因为即使我有26个离散事件,但我最多如果同时在一天中完成的时间是9,如果在该特定行中找不到该列,是否可以用0 / False填充呢?
答案 0 :(得分:2)
这是一个完整的解决方案,可以分析混乱的输出以及所有内容:
from ast import literal_eval
import numpy as np
import pandas as pd
# the raw data
d = '''datetime | mood | activities | notes
8/27/2017 | "good" | ["friends", "party", "gaming"] | NaN
8/28/2017 | "meh" | ["work", "friends", "good food"] | "Stuff stuff"
8/29/2017 | "bad" | ["work", "travel"] | "Fell off my bike"'''
# parse the raw data
df = pd.read_csv(pd.compat.StringIO(d), sep='\s*\|\s*', engine='python')
# parse the lists of activities (which are still strings)
acts = df['activities'].apply(literal_eval)
# get the unique activities
actcols = np.unique([a for al in acts for a in al])
# assemble the desired one hot array from the activities
actarr = np.array([np.in1d(actcols, al) for al in acts])
actdf = pd.DataFrame(actarr, columns=actcols)
# stick the dataframe with the one hot array onto the main dataframe
df = pd.concat([df.drop(columns='activities'), actdf], axis=1)
# fancy print
with pd.option_context("display.max_columns", 20, 'display.width', 9999):
print(df)
输出:
datetime mood notes friends gaming good food party travel work
0 8/27/2017 "good" NaN True True False True False False
1 8/28/2017 "meh" "Stuff stuff" True False True False False True
2 8/29/2017 "bad" "Fell off my bike" False False False False True True
答案 1 :(得分:2)
您可以简单地使用get_dummies
让我们假设这个数据帧:
df = pd.DataFrame({'datetime':pd.date_range('2017-08-27', '2017-08-29'),
'mood':['good','meh','bad'],'activities':[['friends','party','gaming'],
["work", "friends", "good food"],
["work", "travel"]],
'notes':[np.nan, 'stuff stuff','fell off my bike']})
df.set_index(['datetime'], inplace=True)
mood activities notes
datetime
2017-08-27 good [friends, party, gaming] NaN
2017-08-28 meh [work, friends, good food] stuff stuff
2017-08-29 bad [work, travel] fell off my bike
仅concat
和get_dummies
:
df2 = pd.concat([df[['mood','notes']], pd.get_dummies(df['activities'].apply(pd.Series),
prefix='activity')], axis=1)
mood notes activity_friends activity_work activity_friends activity_party activity_travel activity_gaming activity_good food
datetime
2017-08-27 good NaN 1 0 0 1 0 1 0
2017-08-28 meh stuff stuff 0 1 1 0 0 0 1
2017-08-29 bad fell off my bike 0 1 0 0 1 0 0
如果您想使用loc
,则将其更改为布尔值:
df2.loc[:,df2.columns[2:]] = df2.loc[:,df2.columns[2:]].astype(bool)