单词袋标记每个字母而不是单词

时间:2019-09-16 01:40:50

标签: python string list

我在网上找到了一堆单词实现工具。我正在阅读一个包含很多句子的文本文件,这些句子将通过generate_bagOfWords

运行
def stopword_clean(sentence):
    ignore = ["i", "me", "my", "myself", "we", "our", "ours", "ourselves", "you", "your", "yours", "yourself",
              "yourselves", "he", "him", "his", "himself", "she", "her", "hers", "herself", "it", "its", "itself",
              "they", "them", "their", "theirs", "themselves", "what", "which", "who", "whom", "this", "that", "these",
              "those", "am", "is", "are", "was", "were", "be", "been", "being", "have", "has", "had", "having", "do",
              "does", "did", "doing", "a", "an", "the", "and", "but", "if", "or", "because", "as", "until", "while",
              "of", "at", "by", "for", "with", "about", "against", "between", "into", "through", "during", "before",
              "after", "above", "below", "to", "from", "up", "down", "in", "out", "on", "off", "over", "under", "again",
              "further", "then", "once", "here", "there", "when", "where", "why", "how", "all", "any", "both", "each",
              "few", "more", "most", "other", "some", "such", "no", "nor", "not", "only", "own", "same", "so", "than",
              "too", "very", "s", "t", "can", "will", "just", "don", "should", "now", "This", "It", "I"]

    words = re.sub("[^\w]", " ", sentence).split()
    clean_text = [w for w in words if w.lower() not in ignore]
    return clean_text


def tokenize(sentences):
    words = []
    for sentence in sentences:
        x = stopword_clean(sentence)
        words.extend(x)

    words = sorted(list(set(words)))
    return words


def generate_bagOfWords(finalsentences):
    vocab = tokenize(finalsentences)
    print("Word list for document \n{0} \n", format(vocab));

    for sentence in finalsentences:
        words = stopword_clean(sentence)
        bag_vector = numpy.zeros(len(vocab))
        for w in words:
            for i, word in enumerate(vocab):
                if word == w:
                    bag_vector[i] += 1

        print("{0} \n{1}\n".format(sentence, numpy.array(bag_vector)))      

问题是我读了很多这样的句子:

trainingFile = open(r"D:\Desktop\\1565964985_2925534_train_file.data", "r")

# arrays for the sentiments and reviews
sentiment = []
review = []

# for loop that reads each line
for line in trainingFile:
    # data field array separated by tab
    dataFields = line.split("\t")

    # sentiment holds the positive or negative sentiment of the review
    sentiment.append(dataFields[0])
    # review holds the text from the review
    review.append(dataFields[1])

这使我的索引如下:

Review[0]: This book is such a life saver.  It has been so helpful to be able to go back to track trends, answer pediatrician questions, or communicate with each other when you are up at different times of the night with a newborn.  I think it is one of those things that everyone should be required to have before they leave the hospital.  We went through all the pages of the newborn version, then moved to the infant version, and will finish up the second infant book (third total) right as our baby turns 1.  See other things that are must haves for baby at [...]

我得到这个的输出

['1', 'W', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 'u', 'v', 'w', 'y']
T 
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]

h 
[0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]

i 
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]

s 
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]

但是整个句子。

但是,如果我这样做

    review[0] = ["This book is such a life saver.  It has been so helpful to be able to go back to track trends, answer pediatrician questions, or communicate with each other when you are up at different times of the night with a newborn.  I think it is one of those things that everyone should be required to have before they leave the hospital.  We went through all the pages of the newborn version, then moved to the infant version, and will finish up the second infant book (third total) right as our baby turns 1.  See other things that are must haves for baby at [...]"]

它正常工作

['1', 'See', 'able', 'answer', 'baby', 'back', 'book', 'communicate', 'different', 'everyone', 'finish', 'go', 'haves', 'helpful', 'hospital', 'infant', 'leave', 'life', 'moved', 'must', 'newborn', 'night', 'one', 'pages', 'pediatrician', 'questions', 'required', 'right', 'saver', 'second', 'things', 'think', 'third', 'times', 'total', 'track', 'trends', 'turns', 'version', 'went']
This book is such a life saver.  It has been so helpful to be able to go back to track trends, answer pediatrician questions, or communicate with each other when you are up at different times of the night with a newborn.  I think it is one of those things that everyone should be required to have before they leave the hospital.  We went through all the pages of the newborn version, then moved to the infant version, and will finish up the second infant book (third total) right as our baby turns 1.  See other things that are must haves for baby at [...] 
[1. 1. 1. 1. 2. 1. 2. 1. 1. 1. 1. 1. 1. 1. 1. 2. 1. 1. 1. 1. 2. 1. 1. 1.
 1. 1. 1. 1. 1. 1. 2. 1. 1. 1. 1. 1. 1. 1. 2. 1.]

这是怎么回事,如何将整个数组转换为可以正常工作的字符串?

0 个答案:

没有答案