Python:tf-idf-cosine:如何在CSV文件中实现文档相似性

时间:2016-04-20 16:13:13

标签: python csv tf-idf cosine

我有book.csv文件,其中包含一些书目书目清单。我在数据库中也有用户表,其中包含用户信息需求。我的目标是做tf-idf,用户信息需要从数据库表作为查询和book.csv行作为文档之间的余弦相似性,并在插入user_Id时打印出与用户信息需要的大多数相似的行。所以我在将csv raws设置为文档时遇到了一些问题。任何帮助都请注意此错误IndexError: list index out of range。另一个问题是即使我插入正确的User_Id它重放错误消息,直到我达到该用户的号码。即如果用户在数据库表中排在第3位,我将尝试三次这样的

insert User_Id
JU/MF3024/04
no such User exist
insert User_Id
JU/MF3024/04
insert User_Id
JU/MF3024/04
no such User exist
Fit Vectorizer to train set [[0 1]
[1 0]]
Transform Vectorizer to test set [[0 0]
[0 0]

这是我在python 2.7.11中的实现代码。我使用了Python: tf-idf-cosine: to find document similarity

中的一些代码
from sklearn.feature_extraction. text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.metrics.pairwise import linear_kernel
import numpy as np
import numpy.linalg as LA
import pandas as pd
from nltk.corpus import stopwords
from collections import defaultdict
import csv
import mysql.connector as sql
from mysql.connector import connection
with open("Book.csv", "rb") as books:
     reader = csv.reader(books, delimiter =',')
     reader. next()
     count = 0
     docs = {}
     for row in reader:
         docs = row[1].split()#I want to consider each row as document similar to train set on the above linked post
query = "" # like test_set on the above post
config = {'user': 'root', 'password': '929255@Tenth', 'host': '127.0.0.1','database': 'juls', 'raise_on_warnings': True,}
db = ql.connect(**config)
cursor = db.cursor()
query = "SELECT * FROM user"
cursor.execute(query)
result = cursor.fetchall()
for r in result:
    User_Id = r[0]
    First_Name = r[1]
    Last_Name = r[2]
    College = r[3]
    Department = r[4]
    Info_need = r[5]
    email = r[6]
    print "insert User_Id"
    Id = str(raw_input())
    if Id not in User_Id:
        print "no such User exist"
        pass
    elif Id =="":
        print "User ID is blank"
        pass
    else:
        query = "SELECT Info_need from user WHERE User_Id = '%s'" % Id
        cursor.execute(query)
stopWords = set(stopwords.words('english'))
vectorizer = CountVectorizer(stop_words = stopWords)
transformer = TfidfTransformer()
trainVectorizerArray = vectorizer.fit_transform(docs).toarray()
testVectorizerArray = vectorizer.transform(query).toarray()
print 'Fit Vectorizer to train set', trainVectorizerArray
print 'Transform Vectorizer to test set',  testVectorizerArray
cx = lambda a,  b : round(np.inner(a, b)/(LA.norm(a)*LA.norm(b)), 3) 
for vector in trainVectorizerArray:
    for testV in testVectorizerArray:
        cosine = cx(vector,  testV)
transformer.fit(trainVectorizerArray)
transformer.fit(testVectorizerArray)
tfidf = transformer.transform(testVectorizerArray)
print tfidf.todense()
tfidf_vectorizer = TfidfVectorizer()
tfidf_matrix_train = tfidf_vectorizer.fit_transform(docs)
print "RANKED TF-IDF"
print tfidf[0:1]
cosine_similarities = linear_kernel( tfidf[ 0: 1],  tfidf). flatten()
print cosine_similarities
related_docs_indices = cosine_similarities.argsort()[:-5:-1]
print related_docs_indices
print cosine_similarities[related_docs_indices]
print docs[14]

2 个答案:

答案 0 :(得分:0)

我已经解决了将csv raws作为文档的问题。所以下面的代码是我的解决方案之一。请在帖子中帮助解决其他问题。

stopWords = set(stopwords.words('english'))
lines = []
with open('Booklist.csv', 'rb') as f:
reader = csv.reader(f)

for row in reader:
    if reader.line_num == 1:
        continue # to skip frist line
    your_list = row[1]
    lines.append((your_list))
def build_lexicon(corpus):
lexicon = set()
for doc in corpus:
    lexicon.update([word for word in doc.split() if not word in stopWords]) # remove stop words
return lexicon
vocabulary = build_lexicon(lines)
print 'My vocabulary vector is [' + ',' .join(list(vocabulary)) + ']' # prints whole vocabulary words in the column (row[1]) without stop words
for doc in lines:
print 'The doc %d is: %s' % ((lines.index(doc) +1), doc) #prints each line as document which is my intention 

答案 1 :(得分:0)

这个答案是针对提取用户信息需求的问题中的数据库部分

db = sql.connect(**config)
cursor = db.cursor()
Id = str(raw_input("insert User_Id: "))
val= cursor.execute("SELECT MajorSubjectInterest, SubsidiarySubjectInterest  from user WHERE User_Id = '%s'" % (Id))
result = cursor.fetchall()
for r in result:
    Major = r[0]
    Subsidiary = r[1]
    if Subsidiary =="": #SubsidiarySubjectInterest field in the table is allowed Null input
        need = Major
    else:
        need = Major + '; ' + Subsidiary
    queries.append(need) after I extracted user information need I want it to be list and so I add it to empty list. it is also possible like this 'needs = need.split(';',5)'
db.close()