更快地解除所有主题标签的合并

时间:2017-12-02 15:49:56

标签: python twitter hashtag

我想从Twitter数据集中取消合并hastags。例如:"#sunnyday"将是"阳光灿烂的日子"。

我找到了以下代码: 代码找到hastags并查看名为" wordlist.txt"的文件,这是一个巨大的txt文件,其中包含许多匹配单词的单词。

txt。文件可以在这里下载: http://www-personal.umich.edu/~jlawler/wordlist

来源:Term split by hashtag of multiple words

我修改了一下以确保它在句子为空时有效:" "

# Returns a list of common english terms (words)
def initialize_words():
    content = None
    with open('wordlist.txt') as f: # A file containing common english words
        content = f.readlines()
    return [word.rstrip('\n') for word in content]


def parse_sentence(sentence, wordlist):
    new_sentence = "" # output 
    # MODIFICATION: If the sentence is not empty
    if sentence != '':   
        terms = sentence.split(' ')
        for term in terms:
            # MODIFICATION: If the term is not empty
            if term != '':
                if term[0] == '#': # this is a hashtag, parse it
                    new_sentence += parse_tag(term, wordlist)
                else: # Just append the word
                    new_sentence += term
                new_sentence += " "

    return new_sentence 


def parse_tag(term, wordlist):
    words = []
    # Remove hashtag, split by dash
    tags = term[1:].split('-')
    for tag in tags:
        word = find_word(tag, wordlist)    
        while word != None and len(tag) > 0:
            words.append(word)            
            if len(tag) == len(word): # Special case for when eating rest of word
                break
            tag = tag[len(word):]
            word = find_word(tag, wordlist)
    return " ".join(words)


def find_word(token, wordlist):
    i = len(token) + 1
    while i > 1:
        i -= 1
        if token[:i] in wordlist:
            return token[:i]
    return None 

问题在于它需要永远运行! 如何让它更快?

1 个答案:

答案 0 :(得分:0)

为变量report_plan <- drake::drake_plan( report = rmarkdown::render( knitr_in("alerts.Rmd"), output_file = file_out("report.html"), quiet = TRUE), notification = target(slackr::slackr("A new vaccine report is ready"), trigger = trigger(change = file.info("report.html")$ctime)), strings_in_dots = "literals" ) 使用set代替list

这将是一个巨大的性能改进,因为使用wordlist您需要(可能)扫描整个单词列表,因此它是list。对于O(n),它是set,因为通过计算项目的哈希并将其用作后备存储的索引来检查成员资格。