Question

我尝试根据7000字的单词进行情绪分析。该代码在Python中工作，但它选择所有组合而不是不同的单词。

例如，字典说enter，文字说enterprise。如何更改不会将其视为匹配的代码？

dictfile = sys.argv[1]
textfile = sys.argv[2]

a = open(textfile)
text = string.split( a.read() )
a.close()

a = open(dictfile)
lines = a.readlines()
a.close()

dic = {}
scores = {}

current_category = "Default"
scores[current_category] = 0

for line in lines:
   if line[0:2] == '>>':
       current_category = string.strip( line[2:] )
       scores[current_category] = 0
   else:
       line = line.strip()
       if len(line) > 0:
           pattern = re.compile(line, re.IGNORECASE)
           dic[pattern] = current_category

for token in text:
   for pattern in dic.keys():
       if pattern.match( token ):
           categ = dic[pattern]
           scores[categ] = scores[categ] + 1

for key in scores.keys():
   print key, ":", scores[key]

Answer 1

.match()从该行的开头匹配。因此，您可以在注册表中使用行结束锚点：

re.compile(line + '$')

或者你可以使用单词边界：

re.compile('\b' + line + '\b')

Answer 2

你的缩进是不连贯的。有些级别使用3个空格，有些级别使用4个空格。
您尝试将文字中的每个单词与字典中的所有7000个单词进行匹配。而只是在你的字典中查找单词。如果不存在，请忽略错误（EAFP原则）。
此外，我不确定使用类方法（string.split()）是否优于对象方法（"".split()）。
Python还有一个defaultdict，它自己初始化一个0字典。

编辑：

我使用.readlines()和.read()代替.split('\n')。这摆脱了换行符。
不是在默认空格字符处而是在正则表达式'\W+'上分割文本（所有不是“单词字符”）是我试图摆脱标点符号。

在我提议的代码下面：

import sys
from collections import defaultdict

dictfile = sys.argv[1]
textfile = sys.argv[2]

with open(textfile) as f:
    text = f.read()

with open(dictfile) as f:
    lines = f.read()

categories = {}
scores = defaultdict(int)

current_category = "Default"
scores[current_category] = 0

for line in lines.split('\n'):
    if line.startswith('>>'):
        current_category = line.strip('>')
    else:
        keyword = line.strip()
        if keyword:
            categories[keyword] = current_category

for word in re.split('\W+', text):
    try:
        scores[categories[word]] += 1
    except KeyError:
        # no in dictionary
        pass

for keyword in scores.keys():
    print("{}: {}".format(keyword, scores[keyword]))

不同的词语情绪分析

2 个答案: