我在网上找到了一堆单词实现工具。我正在阅读一个包含很多句子的文本文件,这些句子将通过generate_bagOfWords
运行def stopword_clean(sentence):
ignore = ["i", "me", "my", "myself", "we", "our", "ours", "ourselves", "you", "your", "yours", "yourself",
"yourselves", "he", "him", "his", "himself", "she", "her", "hers", "herself", "it", "its", "itself",
"they", "them", "their", "theirs", "themselves", "what", "which", "who", "whom", "this", "that", "these",
"those", "am", "is", "are", "was", "were", "be", "been", "being", "have", "has", "had", "having", "do",
"does", "did", "doing", "a", "an", "the", "and", "but", "if", "or", "because", "as", "until", "while",
"of", "at", "by", "for", "with", "about", "against", "between", "into", "through", "during", "before",
"after", "above", "below", "to", "from", "up", "down", "in", "out", "on", "off", "over", "under", "again",
"further", "then", "once", "here", "there", "when", "where", "why", "how", "all", "any", "both", "each",
"few", "more", "most", "other", "some", "such", "no", "nor", "not", "only", "own", "same", "so", "than",
"too", "very", "s", "t", "can", "will", "just", "don", "should", "now", "This", "It", "I"]
words = re.sub("[^\w]", " ", sentence).split()
clean_text = [w for w in words if w.lower() not in ignore]
return clean_text
def tokenize(sentences):
words = []
for sentence in sentences:
x = stopword_clean(sentence)
words.extend(x)
words = sorted(list(set(words)))
return words
def generate_bagOfWords(finalsentences):
vocab = tokenize(finalsentences)
print("Word list for document \n{0} \n", format(vocab));
for sentence in finalsentences:
words = stopword_clean(sentence)
bag_vector = numpy.zeros(len(vocab))
for w in words:
for i, word in enumerate(vocab):
if word == w:
bag_vector[i] += 1
print("{0} \n{1}\n".format(sentence, numpy.array(bag_vector)))
问题是我读了很多这样的句子:
trainingFile = open(r"D:\Desktop\\1565964985_2925534_train_file.data", "r")
# arrays for the sentiments and reviews
sentiment = []
review = []
# for loop that reads each line
for line in trainingFile:
# data field array separated by tab
dataFields = line.split("\t")
# sentiment holds the positive or negative sentiment of the review
sentiment.append(dataFields[0])
# review holds the text from the review
review.append(dataFields[1])
这使我的索引如下:
Review[0]: This book is such a life saver. It has been so helpful to be able to go back to track trends, answer pediatrician questions, or communicate with each other when you are up at different times of the night with a newborn. I think it is one of those things that everyone should be required to have before they leave the hospital. We went through all the pages of the newborn version, then moved to the infant version, and will finish up the second infant book (third total) right as our baby turns 1. See other things that are must haves for baby at [...]
我得到这个的输出
['1', 'W', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 'u', 'v', 'w', 'y']
T
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
h
[0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
i
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
s
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
但是整个句子。
但是,如果我这样做
review[0] = ["This book is such a life saver. It has been so helpful to be able to go back to track trends, answer pediatrician questions, or communicate with each other when you are up at different times of the night with a newborn. I think it is one of those things that everyone should be required to have before they leave the hospital. We went through all the pages of the newborn version, then moved to the infant version, and will finish up the second infant book (third total) right as our baby turns 1. See other things that are must haves for baby at [...]"]
它正常工作
['1', 'See', 'able', 'answer', 'baby', 'back', 'book', 'communicate', 'different', 'everyone', 'finish', 'go', 'haves', 'helpful', 'hospital', 'infant', 'leave', 'life', 'moved', 'must', 'newborn', 'night', 'one', 'pages', 'pediatrician', 'questions', 'required', 'right', 'saver', 'second', 'things', 'think', 'third', 'times', 'total', 'track', 'trends', 'turns', 'version', 'went']
This book is such a life saver. It has been so helpful to be able to go back to track trends, answer pediatrician questions, or communicate with each other when you are up at different times of the night with a newborn. I think it is one of those things that everyone should be required to have before they leave the hospital. We went through all the pages of the newborn version, then moved to the infant version, and will finish up the second infant book (third total) right as our baby turns 1. See other things that are must haves for baby at [...]
[1. 1. 1. 1. 2. 1. 2. 1. 1. 1. 1. 1. 1. 1. 1. 2. 1. 1. 1. 1. 2. 1. 1. 1.
1. 1. 1. 1. 1. 1. 2. 1. 1. 1. 1. 1. 1. 1. 2. 1.]
这是怎么回事,如何将整个数组转换为可以正常工作的字符串?