我具有以下数据框,其中某个列具有多个值:
my column
0 - ["A", "B"]
1 - ["B", "C", "D"]
2 - ["B", "D"]
如何获得这样的数据框:(其中每一列都使用“我的列”中的值的名称)
"A" "B" "C" "D"
0 - 1 1 0 0
1 - 0 1 1 1
2 - 0 1 0 1
答案 0 :(得分:2)
如果列中有列表,则将Series.str.join
与Series.str.get_dummies
配合使用:
df = df['my column'].str.join('|').str.get_dummies()
print (df)
A B C D
0 1 1 0 0
1 0 1 1 1
2 0 1 0 1
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
df = pd.DataFrame(mlb.fit_transform(df['my column']),columns=mlb.classes_)
print (df)
A B C D
0 1 1 0 0
1 0 1 1 1
2 0 1 0 1
如果有字符串,请将Series.str.strip
与str.get_dummies
一起使用,最后在必要时从列名中删除"
:
df = (df['my column'].str.strip('[]')
.str.get_dummies(', ')
.rename(columns=lambda x: x.strip('"')))
print (df)
A B C D
0 1 1 0 0
1 0 1 1 1
2 0 1 0 1
答案 1 :(得分:1)
只是为了好玩,这是一个天真的假人实现:
import pandas as pd
my_column = pd.Series([['A','B'],['B','C','D'],['B','D']])
frameA = pd.DataFrame(my_column, columns=['my_column'])
#extract all new headers from the DataFrame rows, in order:
headers = sorted(list(set([x for y in frame['my_column'] for x in y])))
#make a list of the DataFrame rows (stored as lists):
rows = [y for x in range(len(frame)) for y in frame.loc[x]]
builder = {} #construct a dictionary to build a new DataFrame from
for header in headers:
column = []
for row in rows:
if header in row:
column.append(1)
else:
column.append(0)
builder.update({header:column})
frameB = pd.DataFrame(builder)
print(frameB)
导致:
A B C D
0 1 1 0 0
1 0 1 1 1
2 0 1 0 1
答案 2 :(得分:0)
我认为您正在寻找的是熊猫中的get_dummies()
函数,您可以找到here的文档
从文档中:
s = pd.Series(list('abca'))
pd.get_dummies(s)
a b c
0 1 0 0
1 0 1 0
2 0 0 1
3 1 0 0
答案 3 :(得分:0)
您可以使用CountVectorizer
,它是专门为此目的而设计的。它需要文本语料,并为此One-Hot Encoding
。
注意:我使用的是“猫”,“狗”,“母牛”,“老虎”,而不是“ A”,“ B”,“ C”,“ D”
代码:
进口:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
将列表元素转换为字符串的方法:
def get_string(listt):
return ' '.join(listt)
从列表创建DataFrame:
my_column = pd.Series([['Cat','Dog'],['Dog','Cow','Tiger'],['Dog','Tiger']])
df = pd.DataFrame(my_column, columns=['my_column'])
print(df)
df['text_data'] = df.my_column.apply(get_string)
print(df)
执行文本向量化: tf_vectorizer = CountVectorizer(stop_words = None) vectorized_data = tf_vectorizer.fit_transform(df.text_data)
准备最终的DataFrame:
final_df = pd.DataFrame(vectorized_data.toarray(),columns=tf_vectorizer.get_feature_names())
print(final_df)
投放:
我们的数据框:
my_column
0 [Cat, Dog]
1 [Dog, Cow, Tiger]
2 [Dog, Tiger]
带有文本列的DataFrame:
my_column text_data
0 [Cat, Dog] Cat Dog
1 [Dog, Cow, Tiger] Dog Cow Tiger
2 [Dog, Tiger] Dog Tiger
预期结果:
cat cow dog tiger
0 1 0 1 0
1 0 1 1 1
2 0 0 1 1