我有一组JSON格式的新闻文章,我在解析数据日期时遇到问题。问题是,一旦文章以JSON格式转换,日期成功转换,但也完成了版本。这是一个例子:
{"date": "December 31, 1995, Sunday, Late Edition - Final", "body": "AFTER a year of dizzying new heights for the market, investors may despair of finding any good stocks left. Navistar plans to slash costs by $112 million in 1996. Advanced Micro Devices has made a key acquisition. For the bottom-fishing investor, therefore, the big nail-biter is: Will the changes be enough to turn a company around? ", "title": "INVESTING IT;"}
{"date": "December 31, 1995, Sunday, Late Edition - Final", "body": "Few issues stir as much passion in so many communities as the simple act of moving from place to place: from home to work to the mall and home again. It was an extremely busy and productive year for us, said Frank J. Wilson, the State Commissioner of Transportation. There's a sense of urgency to get things done. ", "title": "ROAD AND RAIL;"}
{"date": "December 31, 1996, Sunday, Late Edition - Final", "body": "Widespread confidence in the state's economy prevailed last January as many businesses celebrated their most robust gains since the recession. And Steven Wynn, the chairman of Mirage Resorts, who left Atlantic City eight years ago because of local and state regulations, is returning to build a $1 billion two-casino complex. ", "title": "NEW JERSEY & CO.;"}
由于我的目标是计算包含某些单词的文章数量,因此我按以下方式循环文章:
import json
import re
import pandas
for i in range(1995,2017):
df = pandas.DataFrame([json.loads(l) for l in open('USAT_%d.json' % i)])
# Parse dates and set index
df.date = pandas.to_datetime(df.date) # is giving me a problem
df.set_index('date', inplace=True)
我正在研究如何以最有效的方式解决问题的方向。在解析日期时,我正在考虑“忽略一周之后发生的任何事情”。有这样的事吗?
提前致谢
答案 0 :(得分:2)
您可以按str.split
拆分列date
,将第一列和第二列 - month
,day
和year
合并在一起(December 31
和1995
)和上次致电to_datetime
:
for i in range(1995,2017):
df = pandas.DataFrame([json.loads(l) for l in open('USAT_%d.json' % i)])
# Parse dates and set index
#print (df)
a = df.date.str.split(', ', expand=True)
df.date = a.iloc[:,0] + ' ' + a.iloc[:,1]
df.date = pandas.to_datetime(df.date)
df.set_index('date', inplace=True)
print (df)
body \
date
1995-12-31 AFTER a year of dizzying new heights for the m...
1995-12-31 Few issues stir as much passion in so many com...
1996-12-31 Widespread confidence in the state's economy p...
title
date
1995-12-31 INVESTING IT;
1995-12-31 ROAD AND RAIL;
1996-12-31 NEW JERSEY & CO.;