将句子分为段

时间:2017-01-28 01:13:31

标签: python sorting nlp

我想自动将一组句子安排到段落中,此时我正在从文件中读取句子并测量它们之间的字符串距离。我认为下一个合乎逻辑的步骤是找到用于对句子进行分类的其他东西,然后使用这两个属性划分以绘制图形中的句子,然后将KMeans算法应用于它们,这将有助于我设计哪些句子彼此相似,因此哪些句子会进入相同的段落。事实证明这比我想象的要困难,因此我很欣赏任何一个输入,我可以用来测量推文的第二个属性,一个不同的方法或者为我做这个的工具。以下是我目前使用的代码

import re
import math
from collections import Counter
import itertools


#first understadn this code so that we can manipulate it.
WORD = re.compile(r'\w+')

def get_cosine(vec1, vec2):
    intersection =  set(vec1.keys()) & set(vec2.keys())
    numerator = sum([vec1[x] * vec2[x] for x in intersection])

    sum1 = sum([vec1[x]**2 for x in vec1.keys()])
    sum2 = sum([vec2[x]**2 for x in vec2.keys()])
    denominator = math.sqrt(sum1) * math.sqrt(sum2)

    if not denominator:
        return 0.0
    else:
        return float(numerator) / denominator

def text_to_vector(text):
    words = WORD.findall(text)
    return Counter(words)

#count the number of tweets set it to a variable and then set it as the length of this or  what ever
#This is where the text comes from
with open("positive copy.txt", "r") as pt:
    lines = pt.readlines()
    # Count how many lines we have
    count = len(lines)
    # Create a count * count size matrix
    Matrix = [[1 for x in range(count)] for y in range(count)] 
    # Loop through lines assigning x as the number of line we're on and lineA as it's text
    for x, lineA in enumerate(lines):
        vectorA = text_to_vector(lineA)
        for y, lineB in enumerate(itertools.islice(lines, count - x)):
            vectorB = text_to_vector(lineB)
            cosine = get_cosine(vectorA, vectorB)
            print lineA, lineB, "\n Cosine:", cosine, "\n"
            Matrix[y][x]=get_cosine(vectorA, vectorB)
            Matrix[x][y]=get_cosine(vectorA, vectorB)
    print Matrix

以下是我用来运行测试的示例数据

Hello my name is Jeff
Hello everyone I’m named Jeff
this has absolutely nothing to do
everyone Im a doctor
hello I don’t even know whats happening
whats  happening is that you not know

这是我现在的结果:

[[0.9999999999999998, 0.33806170189140655, 0.0, 0.0, 0.0, 0.16903085094570328], [0.33806170189140655, 0.9999999999999999, 0.0, 0.1889822365046136, 0.13363062095621217, 1], [0.0, 0.0, 1.0000000000000002, 0.0, 1, 1], [0.0, 0.1889822365046136, 0.0, 1, 1, 1], [0.0, 0.13363062095621217, 1, 1, 1, 1], [0.16903085094570328, 1, 1, 1, 1, 1]]

虽然这些是上述代码的预期结果,但我希望能够输出一组可以绘制成图形的值。或者一种方法来推导我应该为每个段落分配哪个段落,这更像是一个知识问题而不是一个代码,尽管任何有用的代码都不仅仅是受欢迎的

0 个答案:

没有答案