使用机器学习计算文件重量

时间:2017-08-23 07:15:18

标签: python machine-learning scikit-learn

假设我的列表中有n个文档(简历),我想用Job description.txt作为参考来衡量相同类别的每个文档(简历)。我想按照下面对文件进行称重。我的问题是在这种情况下还有其他方法可以衡量文件吗?提前谢谢。

行动计划:

a)获得与相同类别相关的简历(例如,10)(例如,java)

b)从所有文档中获取一些文字

有:

c) each document get features names by using TFIDF vectorizor scores

d) now I have list of featured words in a list 

e) now compare these features in "Job Discription" Bag of words

f) now count the score for the document by adding the columns and weigh the document

1 个答案:

答案 0 :(得分:0)

我从问题中理解的是,您希望通过查看简历(文档)与职务描述文档的相似程度来对其进行评分。可以使用的一种方法是将所有文档转换为包括作业描述的TFIDF矩阵。每个文档都可以看作单词空间中的向量。创建TFIDF矩阵后,可以使用余弦相似度计算两个文档之间的相似性。

还有其他一些事情要做,比如删除停用词,词形变换和编码。另外,您可能还想使用n-gram。

您也可以参考this book了解更多信息。

修改

添加一些设置代码

import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from nltk.corpus import stopwords
import string
import spacy
nlp = spacy.load('en')

# to remove punctuations
translator = str.maketrans('', '', string.punctuation)

# some sample documents
resumes = ["Executive Administrative Assistant with over 10 years of experience providing thorough and skillful support to senior executives.",
"Experienced Administrative Assistant, successful in project management and systems administration.",
"10 years of administrative experience in educational settings; particular skill in establishing rapport with people from diverse backgrounds.",
"Ten years as an administrative support professional in a corporation that provides confidential case work.",
"A highly organized and detail-oriented Executive Assistant with over 15 years' experience providing thorough and skillful administrative support to senior executives.",
"More than 20 years as a knowledgeable and effective psychologist working with individuals, groups, and facilities, with particular emphasis on geriatrics and the multiple psychopathologies within that population.",
"Ten years as a sales professional with management experience in the fashion industry.",
"More than 6 years as a librarian, with 15 years' experience as an active participant in school-related events and support organizations.",
"Energetic sales professional with a knack for matching customers with optimal products and services to meet their specific needs. Consistently received excellent feedback from customers.",
"More than six years of senior software engineering experience, with strong analytical skills and a broad range of computer expertise.",
"Software Developer/Programmer with history of productivity and successful project outcomes."]

job_doc = ["""Executive Administrative with a knack for matching and effective psychologist with particular emphasis on geriatrics"""]

# combine the two
_all = resumes+job_doc

# convert each to spacy document
docs= [nlp(document) for document in _all]

# lemmatizae words, remove stopwords, remove punctuations
docs_pp = [' '.join([token.lemma_.translate(translator) for token in docs if not token.is_stop]) for docs in docs]

# get tfidf matrix
tfidf_vec = TfidfVectorizer()
tfidf_matrix = tfidf_vec.fit_transform(docs_pp).todense()

# calculate similarity
cosine_similarity(tfidf_matrix[-1,], tfidf_matrix[:-1,])