我想从Twitter数据集中取消合并hastags。例如:"#sunnyday"将是"阳光灿烂的日子"。
我找到了以下代码: 代码找到hastags并查看名为" wordlist.txt"的文件,这是一个巨大的txt文件,其中包含许多匹配单词的单词。
txt。文件可以在这里下载: http://www-personal.umich.edu/~jlawler/wordlist
# Returns a list of common english terms (words)
def initialize_words():
content = None
with open('wordlist.txt') as f: # A file containing common english words
content = f.readlines()
return [word.rstrip('\n') for word in content]
def parse_sentence(sentence, wordlist):
new_sentence = "" # output
# MODIFICATION: If the sentence is not empty
if sentence != '':
terms = sentence.split(' ')
for term in terms:
# MODIFICATION: If the term is not empty
if term != '':
if term[0] == '#': # this is a hashtag, parse it
new_sentence += parse_tag(term, wordlist)
else: # Just append the word
new_sentence += term
new_sentence += " "
return new_sentence
def parse_tag(term, wordlist):
words = []
# Remove hashtag, split by dash
tags = term[1:].split('-')
for tag in tags:
word = find_word(tag, wordlist)
while word != None and len(tag) > 0:
words.append(word)
if len(tag) == len(word): # Special case for when eating rest of word
break
tag = tag[len(word):]
word = find_word(tag, wordlist)
return " ".join(words)
def find_word(token, wordlist):
i = len(token) + 1
while i > 1:
i -= 1
if token[:i] in wordlist:
return token[:i]
return None
问题在于它需要永远运行! 如何让它更快?
答案 0 :(得分:0)
为变量report_plan <- drake::drake_plan(
report = rmarkdown::render(
knitr_in("alerts.Rmd"),
output_file = file_out("report.html"),
quiet = TRUE),
notification = target(slackr::slackr("A new vaccine report is ready"), trigger = trigger(change = file.info("report.html")$ctime)),
strings_in_dots = "literals"
)
使用set
代替list
。
这将是一个巨大的性能改进,因为使用wordlist
您需要(可能)扫描整个单词列表,因此它是list
。对于O(n)
,它是set
,因为通过计算项目的哈希并将其用作后备存储的索引来检查成员资格。