Question

我有一个大约5000个唯一单词/代币的列表，每个单词（一个笑脸算作一个单词）是每行。我试图生成一些适用于SVM for python的东西。

想象一下，示例列表只有几个字

happy
sad
is
:(
i
the
day
am
today
:)

我的字符串是：

tweets =['i am happy today :)','is today the sad day :(']

然后每条推文的输出是：

5:1 8:1 1:1 9:1 10:1
3:1 9:1 6:1 2:1 4:1

请注意这种格式：，这意味着冒号前的第一个数字，应该使用列表中的行号/位置来引用该单词。例如':)'是列表中的第十个单词（文本文件，每行1个标记）。

我正在考虑创建一个读取文本文件的函数，并将每行（每个单词/标记）放入一个列表或字典中的一个位置，以便我可以从每条推文中读取一个单词并将其转换为数字基于其在列表中的位置。

有没有人知道如何在python中执行此操作？然后我在想这样的事情：

 for i in tweets:
         <translate-words-into-list-position>

Answer 1

words = ['happy', 'sad', 'is', ':(', 'i', 'the', 'day', 'am', 'today', ':)']
d = {w: i for i, w in enumerate(words, start=1)}
tweets =['i am happy today :)','is today the sad day :(']
for tweet in tweets:
    print ' '.join(['{0}:1'.format(d[w]) for w in tweet.split() if w in d])


5:1 8:1 1:1 9:1 10:1
3:1 9:1 6:1 2:1 7:1 4:1

如果单词是file，您仍然可以将其与此解决方案一起使用，请记住.rstrip('\n')该行。例如

with open('words.txt', 'rU') as f:
    d = {w.rstrip('\n'): i for i, w in enumerate(f, start=1)}

Answer 2

>>> from itertools import count
>>> D = dict(zip(words, count(1)))
>>> tweets =['i am happy today :)','is today the sad day :(']
>>> [["{}:1".format(D[k]) for k in t.split() if k in D] for t in tweets]
[['5:1', '8:1', '1:1', '9:1', '10:1'], ['3:1', '9:1', '6:1', '2:1', '7:1', '4:1']]

将字符串转换为标记列表的位置

2 个答案: