我基本上尝试对列进行热编码,其值如下:
tickers
1 [DIS]
2 [AAPL,AMZN,BABA,BAY]
3 [MCDO,PEP]
4 [ABT,ADBE,AMGN,CVS]
5 [ABT,CVS,DIS,ECL,EMR,FAST,GE,GOOGL]
...
首先,我得到了所有代码的所有组合(大约467个代码):
all_tickers = list()
for tickers in df.tickers:
for ticker in tickers:
all_tickers.append(ticker)
all_tickers = set(all_tickers)
然后我用这种方式实现了One Hot Encoding:
for i in range(len(df.index)):
for ticker in all_tickers:
if ticker in df.iloc[i]['tickers']:
df.at[i+1, ticker] = 1
else:
df.at[i+1, ticker] = 0
问题是当处理大约5000多行时脚本运行得非常慢。 如何改进算法?
答案 0 :(得分:4)
我认为str.join
需要str.get_dummies
:
df = df['tickers'].str.join('|').str.get_dummies()
或者:
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
df = pd.DataFrame(mlb.fit_transform(df['tickers']),columns=mlb.classes_, index=df.index)
print (df)
AAPL ABT ADBE AMGN AMZN BABA BAY CVS DIS ECL EMR FAST GE \
1 0 0 0 0 0 0 0 0 1 0 0 0 0
2 1 0 0 0 1 1 1 0 0 0 0 0 0
3 0 0 0 0 0 0 0 0 0 0 0 0 0
4 0 1 1 1 0 0 0 1 0 0 0 0 0
5 0 1 0 0 0 0 0 1 1 1 1 1 1
GOOGL MCDO PEP
1 0 0 0
2 0 0 0
3 0 1 1
4 0 0 0
5 1 0 0
答案 1 :(得分:1)
您可以使用apply(pd.Series)
然后使用get_dummies()
:
df = pd.DataFrame({"tickers":[["DIS"], ["AAPL","AMZN","BABA","BAY"],
["MCDO","PEP"], ["ABT","ADBE","AMGN","CVS"],
["ABT","CVS","DIS","ECL","EMR","FAST","GE","GOOGL"]]})
pd.get_dummies(df.tickers.apply(pd.Series), prefix="", prefix_sep="")
AAPL ABT DIS MCDO ADBE AMZN CVS PEP AMGN BABA DIS BAY CVS ECL \
0 0 0 1 0 0 0 0 0 0 0 0 0 0 0
1 1 0 0 0 0 1 0 0 0 1 0 1 0 0
2 0 0 0 1 0 0 0 1 0 0 0 0 0 0
3 0 1 0 0 1 0 0 0 1 0 0 0 1 0
4 0 1 0 0 0 0 1 0 0 0 1 0 0 1
EMR FAST GE GOOGL
0 0 0 0 0
1 0 0 0 0
2 0 0 0 0
3 0 0 0 0
4 1 1 1 1