我想找到从给定文本匹配的正面和负面单词的总数。我有positive.txt
文件中的正面词汇列表和negative.txt
文件中的否定词汇列表。如果一个单词与正单词列表匹配,那么我想要一个简单的整数变量,其值增加1,负匹配单词相同。从我给定的代码中我得到一个在@class=[story-hed]
下的段落。这是我想要与正面和负面单词列表以及单词总数进行比较的文本。我的代码是,
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from dawn.items import DawnItem
class dawnSpider(BaseSpider):
name = "dawn"
allowed_domains = ["dawn.com"]
start_urls = [
"http://dawn.com/"
]
def parse(self, response):
hxs = HtmlXPathSelector(response)
sites = hxs.select('//h3[@class="story-hed"]//a/text()').extract()
items=[]
for site in sites:
item=DawnItem()
item['title']=site
items.append(item)
return items
答案 0 :(得分:4)
下面的独立代码可以解决这个问题:
from collections import Counter
def readwords( filename ):
f = open(filename)
words = [ line.rstrip() for line in f.readlines()]
return words
positive = readwords('positive.txt')
negative = readwords('negative.txt')
paragraph = 'this is really bad and in fact awesome. really awesome.'
count = Counter(paragraph.split())
pos = 0
neg = 0
for key, val in count.iteritems():
key = key.rstrip('.,?!\n') # removing possible punctuation signs
if key in positive:
pos += val
if key in negative:
neg += val
print pos, neg
以下是我在两个输入文件中的内容:
positive.txt:
good
awesome
negative.txt:
bad
ugly
,输出为: 2 1
要在scrapy中实现此功能,您可能需要使用项目管道http://doc.scrapy.org/en/latest/topics/item-pipeline.html
答案 1 :(得分:0)
首先,您可能想要阅读这些文件。假设每行有一个单词,您可以使用以下代码阅读所有单词:
postive = [l.strip() for l in open("possitive.txt")]
完成后,您可以创建一个dict,它将单词保存为键,计数作为值。要将字典启动为零,您可以使用:
positive_count = dict.fromkeys(postive, 0)
最后,如果发现了世界,你必须重复所有项目并增加计数:
for item in items:
if item in positive_count:
postive_count[item] +=1
最后,您可以打印结果:
for item, value in postive_counts.iteritems():
print "Word %s count %d" % (item, value)
对于否定将是相同的,只是为了简化答案而省略。
答案 2 :(得分:0)
这取决于单词列表的大小。如果它们很小(小于几kb),那么将它们读入一个列表:
with open(positive_wordlist_file_name) as fd:
positive_words = [line.strip() for line in fd]
一旦你有两个单词列表,你就可以用它们来完成文本 - 如果可以的话,一行一行。将它们拆分为单词,然后使用“in”运算符在列表中检查它们。我会在课堂上使用几个协同程序:
class WordCounter:
# You can probably read word lists and store them here
def positive_word_counter(self):
"""Co-routine that will count positive words. I'll leave it to reader
to make a similar negative word one"""
self.positive_words = 0
while True:
words = yield
matched = [word for word in words if word in self.positive_words]
self.positive_words += len(matched)
def read_text(text):
"""Text - some iterable of lines - an file handle, or list or whatever."""
#expand on this split with other word separators - or use re.split with the word boundary instead
line_words = (line.strip().split(' ,') for line in text)
#Create and prime coroutines
positive_counter = self.positive_word_counter()
positive_counter.next()
negative_counter = self.negative_word_counter()
negative_counter.next()
#Now fire it in
[[positive_counter.next(words), negative_counter.next(words)] for words in line_words]
#You should now be able to read positive/negative words from this object
答案 3 :(得分:0)
for key, val in count.iteritems():
==>仅当您使用python 3以上版本时,它才在python 3以下版本中工作
for key, val in count.item()
key = key.rstrip('.,?!\n') # removing possible punctuation signs
if key in positive:
pos += val
if key in negative:
neg += val