Question

我写了一些代码，使用reddit praw api在reddit上找到提交标题中最受欢迎的单词。

import nltk
import praw

picksub = raw_input('\nWhich subreddit do you want to analyze? r/')
many = input('\nHow many of the top words would you like to see? \n\t> ')

print 'Getting the top %d most common words from r/%s:' % (many,picksub)
r = praw.Reddit(user_agent='get the most common words from chosen subreddit')
submissions = r.get_subreddit(picksub).get_top_from_all(limit=200)

hey = []

for x in submissions:
    hey.extend(str(x).split(' '))   

fdist = nltk.FreqDist(hey) # creates a frequency distribution for words in 'hey'
top_words = fdist.keys()

common_words = ['its','am', 'ago','took', 'got', 'will', 'been', 'get', 'such','your','don\'t', 'if', 'why', 'do', 'does', 'or', 'any', 'but', 'they', 'all', 'now','than','into','can', 'i\'m','not','so','just', 'out','about','have','when', 'would' ,'where', 'what', 'who' 'I\'m','says' 'not', '', 'over', '_', '-','after', 'an','for', 'who', 'by', 'from', 'it', 'how', 'you', 'about' 'for', 'on', 'as', 'be', 'has', 'that', 'was', 'there', 'with','what', 'we', '::', 'to', 'the', 'of', ':', '...', 'a', 'at', 'is', 'my', 'in' , 'i', 'this', 'and', 'are', 'he', 'she', 'is', 'his', 'hers']
already = []
counter = 0
number = 1

print '-----------------------'
for word in top_words:  
    if word.lower() not in common_words and word.lower() not in already:
        print str(number) + ". '" + word + "'"
        counter +=1
    number +=1
    already.append(word.lower())
if counter == many:
    break
print '-----------------------\n'

所以输入subreddit＆＃39; python＆＃39;获得10个帖子的回复：

＆＃39;的Python＆＃39;
＆＃39; PyPy＆＃39;
＆＃39;代码＆＃39;
＆＃39;使用＆＃39;
＆＃39; 136＆＃39;
＆＃39; 181＆＃39;
＆＃39; d ...＆＃39;
＆＃39; IPython的＆＃39;
＆＃39; 133＆＃39;
10。＆＃39; 158＆＃39;

如何使此脚本不返回数字，以及错误字样如＆＃39; d ...＆＃39;？前4个结果是可以接受的，但我想用有意义的单词替换这个结果。制作列表common_words是不合理的，并且不会过滤这些错误。我编写代码比较新，我很感激帮助。

Answer 1

我不同意。制作常用单词列表是正确的，没有更简单的方法可以过滤掉for，I，am等。但是，使用common_words列表过滤掉不是单词的结果是不合理的，因为那样做你必须包括你不想要的每一个非字。非文字应该被过滤掉。

一些建议：
1）common_words应该是set()，因为你的列表很长，这应该加快速度。 O（1）中的集合的in操作，而列表中的集合是O（n）。

2）摆脱所有数字字符串是微不足道的。你可以做到的一种方法是：

all([w.isdigit() for w in word])

如果返回True，那么该单词只是一系列数字。

3）摆脱d ...有点棘手。这取决于你如何定义一个非单词。这样：

tf = [ c.isalpha() for c in word ]

返回True / False值的列表（如果char不是字母，则为False）。然后，您可以计算以下值：

t = tf.count(True)
f = tf.count(False)

然后，您可以将非单词定义为其中包含更多非字母字符而非字母的单词，作为具有任何非字母字符的单词，等等。例如：

def check_wordiness(word):
    # This returns true only if a word is all letters
    return all([ c.isalpha() for c in word ])

4）在for word in top_words:区块中，您确定没有混淆计数器和号码吗？此外，计数器和数字几乎是多余的，您可以将最后一位重写为：

for word in top_words:
    # Since you are calling .lower() so much, 
    # you probably want to define it up here
    w = word.lower() 
    if w not in common_words and w not in already:
        # String formatting is preferred over +'s
        print "%i. '%s'" % (number, word)
        number +=1
    # This could go under the if statement. You only want to add
    # words that could be added again.  Why add words that are being
    # filtered out anyways?
    already.append(w)

    # this wasn't indented correctly before
    if number == many:
        break

希望有所帮助。

有没有更好的方法从python中的列表中获取“重要单词”？

10。＆＃39; 158＆＃39;

1 个答案:

有没有更好的方法从python中的列表中获取“重要单词”？

10。 ＆＃39; 158＆＃39;

1 个答案:

10。＆＃39; 158＆＃39;