用于从网址中提取最常用词的Python脚本

时间:2019-05-25 09:03:29

标签: python web-scraping

此Python脚本从文本文件中提取最常见的单词。 我想重写此代码以执行相同的操作,但请访问此博客-http://teonite.com/blog/

我该怎么做?

该脚本必须遍历博客的所有7页,输入每篇文章并获得文字和作者的名字。

import collections
import pandas as pd
import inline
import matplotlib.pyplot as plt
%matplotlib inline
# Read input file, note the encoding is specified here
# It may be different in your text file
file = open('PrideandPrejudice.txt', encoding="utf8")


a= file.read()

# Stopwords

stopwords = set(line.strip() for line in open('stopwords.txt'))
stopwords = stopwords.union(set(['mr','mrs','one','two','said']))
# Instantiate a dictionary, and for every word in the file,
# Add to the dictionary if it doesn't exist. If it does, increase 
the count.
wordcount = {}
# To eliminate duplicates, remember to split by punctuation, and use 
case demiliters.
for word in a.lower().split():
    word = word.replace(".","")
    word = word.replace(",","")
    word = word.replace(":","")
    word = word.replace("\"","")
    word = word.replace("!","")
    word = word.replace("“","")
    word = word.replace("‘","")
    word = word.replace("*","")
    if word not in stopwords:
        if word not in wordcount:
        wordcount[word] = 1
        else:
            wordcount[word] += 1
# Print most common word
n_print = int(input("How many most common words to print: 10 "))
print("\nOK. The {} most common words are as 
follows\n".format(n_print))
word_counter = collections.Counter(wordcount)
for word, count in word_counter.most_common(n_print):
    print(word, ": ", count)
# Close the file
file.close()
# Create a data frame of the most common words
# Draw a bar chart
lst = word_counter.most_common(n_print)
df = pd.DataFrame(lst, columns = ['Word', 'Count'])
df.plot.bar(x='Word',y='Count')

# Also I need all authors names. This is a loop for

0 个答案:

没有答案