我想知道它在实时/流媒体新闻中分析标题新闻(使用NTLK / sentiment vader)是否可行。
在提供新闻系统(标题)的代码下方
import praw
import time
reddit = praw.Reddit(client_id='xxxx',
client_secret='MLK5gKaEM2FxxxxxxxxI', user_agent='testing_api')
# must be edited to properly authenticate
subreddit = reddit.subreddit('worldnews')
seen_submissions = set()
while True:
for submission in subreddit.new(limit=10):
if submission.fullname not in seen_submissions:
seen_submissions.add(submission.fullname)
print('{} {}\n'.format(submission.title, submission.url))
time.sleep(60) # sleep for a minute (60 seconds)
使用 SentimentIntensityAnalyzer 构建:
from IPython import display
import math
from pprint import pprint
import pandas as pd
import numpy as np
import nltk
nltk.download('vader_lexicon')
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(style='darkgrid', context='talk', palette='Dark2')
import praw
reddit = praw.Reddit(client_id='xxxx',
client_secret='MLK5gKaEM2FxxxxxxxxI', user_agent='testing_api')
subreddit = reddit.subreddit('worldnews')
headlines = set()
while True:
for submission in subreddit.new(limit=10):
if submission.title not in headlines:
headlines.add(submission.title)
time.sleep(60) # sleep for a minute (60 seconds)
from nltk.sentiment.vader import SentimentIntensityAnalyzer as SIA
sia = SIA()
results = []
for line in headlines:
pol_score = sia.polarity_scores(line)
pol_score['headline'] = line
results.append(pol_score)
pprint(results[], width=100)
我看不到控制台中显示的任何内容...我希望看到(实时)类似的东西
{'compound': -0.5267,
'headline': 'Report: Nearly Half of Americans Breathing Unhealthy Air',
'neg': 0.327,
'neu': 0.673,
'pos': 0.0},
{'compound': -0.0754,
'headline': 'The Implications of Trump Derangement Syndrome | Even now, vehement Trump '
'supporters seem to believe that most criticism of the president is explained by '
'widespread TDS.',
'neg': 0.11,
'neu': 0.791,
'pos': 0.1}]
答案 0 :(得分:0)
您似乎没有提供完整的示例。您仍然需要调用polarity_scores()
并将其添加到数据结构中。
例如,如果您要使用字典:
reddit = praw.Reddit( ... )
sub = reddit.subreddit('worldnews')
analyzer = SentimentIntensityAnalyzer()
results = {}
posts = sub.new(limit=10)
for post in posts:
title = post.title
if title in results:
# skip title if previously encountered
continue
score = analyzer.polarity_scores(title)
results[title] = score
results[title]['headline'] = title
您还可以通过按日期进行搜索,或者仅跟踪最后看到的帖子的时间戳,并缩短其他帖子的循环并像开始时一样使用原始set()
,从而使查询和循环更加高效
results = set()
...
if post.created > last_date:
break
last_date = post.created
score = analyzer.polarity_scores(post.title)
score['headline'] = post.title
results.add(score)
您可能会发现本教程有助于获得有关构建这样的系统的更多详细信息: https://www.codeproject.com/Articles/5269358/Introducing-NLTK-for-Natural-Language-Processing-w