Question

我希望你能帮助我从.txt文件中读取一行（将这些作为单独的文档处理），并确定每条推文的tf-idf。

# -*- coding: utf-8 -*-
from __future__ import division, unicode_literals 
import math
from textblob import TextBlob as tb

def tf(word, blob):
    return blob.words.count(word) / len(blob.words)

def n_containing(word, bloblist):
    return sum(1 for blob in bloblist if word in blob)

def idf(word, bloblist):
    return math.log(len(bloblist) / (1 + n_containing(word, bloblist)))

def tfidf(word, blob, bloblist):
    return tf(word, blob) * idf(word, bloblist)

document1 = tb("""RT @brides: These are 5 hidden jobs no one one tells about one maids-of-honor one about. You're welcome: jobs http://t.co/qybBewFDre
This brides week on brides twitter: One new brides follower via http://t.co/0NP5Wz70Op""")

document2 = tb("""Python, from the Greek word (Ï€ÏÎ¸Ï‰Î½/Ï€ÏÎ¸Ï‰Î½Î±Ï‚), is a genus of
nonvenomous pythons[2] found in Africa and Asia. Currently, 7 species are
recognised.[2] A member of this genus, P. reticulatus, is among the longest
snakes known.""")

document3 = tb("""The Colt Python is a .357 Magnum caliber revolver formerly
manufactured by Colt's Manufacturing Company of Hartford, Connecticut.
It is sometimes referred to as a "Combat Magnum".[1] It was first introduced
in 1955, the same year as Smith & Wesson's M29 .44 Magnum. The now   discontinued
Colt Python targeted the premium revolver market segment. Some firearm
collectors and writers such as Jeff Cooper, Ian V. Hogg, Chuck Hawks, Leroy
Thompson, Renee Smeets and Martin Dougherty have described the Python as the
finest production revolver ever made.""")

bloblist = [document1, document2, document3]
for i, blob in enumerate(bloblist):
    print("Top words in document {}".format(i + 1))
    scores = {word: tfidf(word, blob, bloblist) for word in blob.words}
    sorted_words = sorted(scores.items(), key=lambda x: x[1], reverse=True)
    for word, score in sorted_words[:3]:
        print("Word: {}, TF-IDF: {}".format(word, round(score, 5)))

Answer 1

我不确定我是否理解正确。

file_names = ['file1.txt','file2.txt']
#open files
files =  map(open,file_names)
#read files
documents = [file.read() for file in files]
#close files
[file.close() for file in files]
#create blobs
bloblist = map(tb,documents)

有关阅读和撰写文件的更多信息，请访问：https://docs.python.org/2/tutorial/inputoutput.html#reading-and-writing-files

您可以从以下文件中解析字符串：

example_string ="""Twitter feed 1: foo
Twitter feed 2: bar
Twitter feed 3: foobar
"""

#parsing using python string methods:
lines_list = example_string.split('\n')
for line in lines_list:
    msg_start_poz = line.find(':') + 1
    tweet_msg = line[msg_start_poz:]
    print tweet_msg

#using regular expressions
pattern = re.compile('^Twitter feed [0-9]+:(.*?)$',re.MULTILINE)
for tweet in re.finditer(pattern,example_string):
    print tweet.group(1)

Python代码，用于从.txt文件中确定每条推文的tf-idf

1 个答案: