Sklearn cosine_similarity在python中将1D数组转换为2D数组

时间:2019-03-07 03:21:58

标签: python scikit-learn nltk cosine-similarity

我正在学习自然语言处理,并在Python中使用nltk模块和scikit学习模块。在编写自己的代码之前,我想先看看现有代码的工作方式。因此,我在网上寻找了基于这些库构建的聊天机器人,然后在github上找到了一个聊天机器人。 我下载了一个聊天机器人的github代码,该聊天机器人使用了Scikit学习和nltk模块。这是该代码

from __future__ import division
import numpy as np
import pandas as pd
import sys
import nltk
import pyprind
from nltk.corpus import wordnet as wn
from sklearn.externals import joblib
from sklearn.metrics.pairwise import cosine_similarity 

SYNSETS = joblib.load('blobs/SYNSETS.pkl')
TAGS_HASH = joblib.load('blobs/TAGS_HASH.pkl')
data = pd.read_csv('data/friends-final.txt', sep='\t')
triturns = joblib.load('blobs/triturns.pkl')
filtered_triturns = joblib.load('blobs/filtered.pkl')
all_tags = ['CC', 'CD', 'DT', 'EX', 'FW', 'IN', 'JJ', 'JJR', 'JJS', 'LS', 'MD',
       'NN', 'NNS', 'NNP', 'NNPS', 'PDT', 'POS', 'PRP', 'PRP$', 'RB',
       'RBR', 'RBS', 'RP', 'SYM', 'TO', 'UH', 'VB', 'VBD', 'VBG', 'VBN',
       'VBP', 'VBZ', 'WDT', 'WP', 'WP$', 'WRB']

def get_synsets(text):
    if not text in SYNSETS:
        sent = nltk.pos_tag(nltk.word_tokenize(text))
        chunks = nltk.ne_chunk(sent, binary=False)
        s = set()
        def add_synsets(synsets):
            for synset in synsets:
                s.add(synset)
        for c in chunks:
            if hasattr(c, 'node'):
                if c.node == 'PERSON':
                    add_synsets(wn.synsets('person', pos=wn.NOUN))
                elif c.node == 'ORGANIZATION':
                    add_synsets(wn.synsets('organization', pos=wn.NOUN))                
                elif c.node == 'GPE':
                    add_synsets(wn.synsets('place', pos=wn.NOUN))
                elif c.node == 'LOCATION':
                    add_synsets(wn.synsets('location', pos=wn.NOUN))
                elif c.node == 'FACILITY':
                    add_synsets(wn.synsets('facility', pos=wn.NOUN))
                elif c.node == 'GSP':
                    add_synsets(wn.synsets('group', pos=wn.NOUN))                
                else:
                    print c, c.node, c.leaves()
            elif c[1][:2] in ['VB', 'JJ', 'ADV', 'NN']:
                pos = {'VB': wn.VERB, 'NN': wn.NOUN, 'ADV': wn.ADV, 'JJ': wn.ADJ}[c[1][:2]]
                add_synsets(wn.synsets(c[0], pos=pos))
            else:
                add_synsets(wn.synsets(c[0]))
        SYNSETS[text] = set([x.name for x in s])
    return SYNSETS[text]

def sem_sim(s1, s2):
    ss1 = get_synsets(s1)
    ss2 = get_synsets(s2)
    if ss1 == ss2:
        return 1
    return 2*len(ss1.intersection(ss2)) / (len(ss1) + len(ss2))

def cos_sim(s1, s2):
    d = [{}, {}]
    for p in all_tags:
        d[0][p] = d[1][p] = 0
    for i,s in enumerate([s1, s2]):
        if not s in TAGS_HASH:
            TAGS_HASH[s] = nltk.pos_tag(nltk.word_tokenize(s))
        tags = TAGS_HASH[s]
        for t in tags:
            if t[1] in d[i]:
                d[i][t[1]] += 1
    return cosine_similarity([d[0][p] for p in all_tags], [d[1][p] for p in all_tags])[0][0] #ERROR OCCURS HERE

def sim(s1, s2, alpha=0.7):
    return alpha*sem_sim(s1, s2) + (1-alpha)*cos_sim(s1, s2) #ERROR OCCURS HERE

def get_response(msg):
    best_val = 0
    best = None
    bar = pyprind.ProgBar(len(filtered_triturns), monitor=True)
    for t in filtered_triturns:
        question = t[0]
        answer = t[1]
        val = sim(msg, question)
        if (val > best_val):# or (val == best_val and len(answer) < len(msg))):
            best = answer
            best_val = val
        bar.update()
    return best

def main():
    while True:
        msg = raw_input('--> ')
        print get_response(msg)
        sys.stdout.flush()

def filter_triturns(thresh=0.7):
    L = []
    bar = pyprind.ProgBar(len(triturns), monitor=True)
    for i,tt in enumerate(triturns):
        a = data.irow(tt)['line']
        b = data.irow(tt+1)['line']
        c = data.irow(tt+2)['line']
        if sem_sim(a, b) > thresh:
            L.append([a,b])
        if sem_sim(b, c) > thresh:
            L.append([c,b])
        bar.update()
    return L

if __name__ == '__main__':
    main()

但是运行此代码后,我会收到错误消息:

ValueError: Expected 2D array, got 1D array instead:
array=[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.

错误发生在行return cosine_similarity([d[0][p] for p in all_tags], [d[1][p] for p in all_tags])[0][0]及其下面的行(在上面的代码中用#标记了所述行)

基本上是在问我要提供2D数组,而不是我尝试做的1D数组,但这似乎行不通。我是sklearn的新手,所以我不完全知道该在什么时候将代码放在方括号内。

将[abc]这样的1D数组转换为2D数组会遵循[[abc]]的过程,但是在这种情况下我该在哪里做呢?

我正在使用python 2.7

2 个答案:

答案 0 :(得分:2)

您必须通过cosine_similarity计算来更改它,只需更改

cosine_similarity([d[0][p] for p in all_tags], [d[1][p] for p in all_tags])[0][0]

cosine_similarity([[d[0][p] for p in all_tags]], [[d[1][p] for p in all_tags]])[0][0]

答案 1 :(得分:-1)

另一种可能性是使用.reshape()函数