我有以下数据集:
-- Your "changed" query.
UNION -- or UNION ALL, look up the difference, cuz I forgot it.
-- Your original query.
HAVING count(ugf.usergroupid) = 0
其中列data = pd.DataFrame({'Members':['Biology PhD student', 'Chemistry Master student', 'Engineering undergraduate student', 'Administration staff',
'Reception staff', 'Research Associate Chemistry', 'Associate Statistics'], 'UCode':[1,1,1,2,2,1,1],'id':['aaa100','aaa121','aa123','bb212','bb214','aa111','aa109']})
data
Members UCode id
0 Biology PhD student 1 aaa100
1 Chemistry Master student 1 aaa121
2 Engineering undergraduate student 1 aa123
3 Administration staff 2 bb212
4 Reception staff 2 bb214
5 Research Associate Chemistry 1 aa111
6 Associate Statistics 1 aa109
包含描述每个列出成员函数的字符串。
您建议哪种文本分析只使用df.Members
列的信息(文本)来查找类似成员的组?例如,在这个玩具示例中,分析应返回两个不同的组。我正在考虑两个字符串/单词列表之间的相似度。
任何建议/帮助非常感谢。
谢谢,
马可
答案 0 :(得分:1)
简单的等字计数器,例如
from collections import Counter
WordCounter = Counter()
for text in members:
words = text.split(' ')
for word in words:
WordCounter[word] += 1
print(WordCounter.most_common(3))
<强>输出强>: [(&#39;学生&#39;,3),(&#39;员工&#39;,2),(&#39;员工&#39;,2)]
答案 1 :(得分:0)
您需要转换string
&#39;会员&#39;进入word-vector
然后执行聚类这些向量,如果你不知道 apriori组的数量,或分类任务,如果您确实知道类/组的数量。
答案 2 :(得分:0)