我正在与熊猫合作,对新闻列表进行一些计算,即在按日期分组以及按源分组以将其输出到JS图表中时,获得NLP数据的平均值。使用20k条记录,操作需要2到3秒。如果可能的话,我希望将其降低到.5以下。代码是:
articles = [{'title': "article title", 'rounded_polarity': 63, 'rounded_subjectivity': 45, 'source_name': 'foxnews', 'day': '2020-01-11 00:00:00+00:00'}, ...]
def get_averages(articles):
data_frame = DataFrame(articles)
grouped_by_day = data_frame.groupby(['day']).mean()
grouped_by_source = data_frame.groupby(['source_name']).mean()
grouped_by_day_dict = grouped_by_day.to_dict()
grouped_by_source_dict = grouped_by_source.to_dict()
max_sentiments = grouped_by_source.idxmax().to_dict()
min_sentiments = grouped_by_source.idxmin().to_dict()
total_avg_subjectivity = statistics.mean([v for k, v in grouped_by_source_dict['rounded_subjectivity'].items()])
total_avg_sentiment = statistics.mean([v for k, v in grouped_by_source_dict['rounded_polarity'].items()])
return {
'most_positive_source': max_sentiments['rounded_polarity'],
'least_positive_source': min_sentiments['rounded_polarity'],
'most_subjective_source': max_sentiments['rounded_subjectivity'],
'least_subjective_source': min_sentiments['rounded_subjectivity'],
'average_sentiment': total_avg_sentiment,
'average_subjectivity': total_avg_subjectivity,
'averages_by_day': grouped_by_day_dict,
'earliest_publish_date': grouped_by_day.index.min(),
'latest_publish_date': grouped_by_day.index.max()
我如何利用更多内置功能的熊猫来加快速度?
答案 0 :(得分:1)
好吧,我认为熊猫和麻木的走法与您所做的非常相似,只是使用内置的函数和方法:
import pandas as pd
import numpy as np
articles = [{'title': "article title", 'rounded_polarity': 63, 'rounded_subjectivity': 45, 'source_name': 'foxnews', 'day': '2020-01-11 00:00:00+00:00'}]
df = pd.DataFrame(articles)
grouped_by_day = df.groupby('day').mean()
grouped_by_source = df.groupby('source_name').mean()
max_sentiments = grouped_by_source.idxmax()
min_sentiments = grouped_by_source.idxmin()
total_avg = np.mean(grouped_by_source.to_numpy()) # equivalent to grouped_by_source.mean() if you don't want to add numpy dependency, however numpy is faster!
result = {'most_positive_source': max_sentiments['rounded_polarity'],
'least_positive_source': min_sentiments['rounded_polarity'],
'most_subjective_source': max_sentiments['rounded_subjectivity'],
'least_subjective_source': min_sentiments['rounded_subjectivity'],
'average_sentiment': total_avg['rounded_polarity'],
'average_subjectivity': total_avg['rounded_subjectivity'],
'averages_by_day': grouped_by_day.to_dict(),
'earliest_publish_date': grouped_by_day.index.min(),
'latest_publish_date': grouped_by_day.index.max()}