如何添加目标列中存在的字符串计数。
data = [{'target': ['Aging','Brain', 'Neurons', 'Genetics']},
{'target': ['Dementia', 'Genetics']},
{'target': ['Brain','Dementia', 'Genetics']}]
df = pd.DataFrame(data)
数据框
target
0 [Aging, Brain, Neurons, Genetics]
1 [Dementia, Genetics]
2 [Brain, Dementia, Genetics]
唯一标签
target = []
for sublist in df['target'].values:
tmp_list = [x.strip() for x in sublist]
target.extend(tmp_list)
target = list(set(target))
# ['Brain', 'Neurons', 'Aging', 'Genetics', 'Dementia']
答案 0 :(得分:2)
如果需要指示器列(仅0
或1
):
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
df1 = pd.DataFrame(mlb.fit_transform(df['target']),columns=mlb.classes_)
print (df1)
Aging Brain Dementia Genetics Neurons
0 1 1 0 1 1
1 0 0 1 1 0
2 0 1 1 1 0
或将Series.str.join
与Series.str.get_dummies
--但速度较慢:
df1 = df['target'].str.join('|').str.get_dummies()
如果需要列表中的计数值:
data = [{'target': ['Neurons','Brain', 'Neurons', 'Neurons']},
{'target': ['Dementia', 'Genetics']},
{'target': ['Brain','Brain', 'Genetics']}]
df = pd.DataFrame(data)
from collections import Counter
df = pd.DataFrame([Counter(x) for x in df['target']]).fillna(0).astype(int)
print (df)
Brain Dementia Genetics Neurons
0 1 0 0 3
1 0 1 1 0
2 2 0 1 0
答案 1 :(得分:0)
也许这会有所帮助
# Instead of creation of target list ,
# Convert list of str to one single str
list_to_str = [" ".join(tags['target']) for tags in data]
##
#['Aging Brain Neurons Genetics',
# 'Dementia Genetics',
# 'Brain Dementia Genetics',
# 'Neurons Brain Neurons Neurons'
# ]
# Using CountVector
from sklearn.feature_extraction.text import CountVectorizer
text_data = np.array(list_to_str)
# Create the bag of words feature matrix
count = CountVectorizer()
bag_of_words = count.fit_transform(text_data) # needs to coverted to array
# Get feature names
feature_names = count.get_feature_names()
# Create df
df1 = pd.DataFrame(bag_of_words.toarray(), columns=feature_names)
print(df1)
## Output
aging brain dementia genetics neurons
0 1 1 0 1 1
1 0 0 1 1 0
2 0 1 1 1 0
3 0 1 0 0 3