I was wondering whether it was possible to aggregate JSON data into new values in python.
For example a single JSON value looks like this:
{"time": {"Friday": {"20:00": 2, "19:00": 1, "22:00": 10, "21:00": 5,
"23:00": 14, "0:00": 2, "18:00": 2}, "Thursday": {"23:00": 1,
"0:00": 1, "19:00": 1, "18:00": 1, "16:00": 2, "22:00": 2},
"Wednesday": {"17:00": 2, "23:00": 3, "16:00": 1, "22:00": 1,
"19:00": 1, "21:00": 1}, "Sunday": {"16:00": 2, "17:00": 2, "19:00": 1,
"22:00": 4, "21:00": 4, "0:00": 3, "1:00": 2}, "Saturday":
{"21:00": 4, "20:00": 3, "23:00": 10, "22:00": 7, "18:00":
1, "15:00": 2, "16:00": 1, "17:00": 1, "0:00": 8, "1:00":
1}, "Tuesday": {"19:00": 1, "17:00": 1, "1:00": 2, "21:00":
1, "23:00": 3}, "Monday": {"18:00": 2, "23:00": 1, "22:00": 2}}
I want to aggregate it such that it is in four categories based on the times it is open.
The four categories are:
6am - 12 noon : morning
12 noon - 5 pm: afternoon
5 pm - 11 pm: evening
11 pm - 6 am: night
For example:
If this is the current value:
“Friday”:{“20:00”: 5,“21:00”: 10}
Then the output should be:
"Friday": {"morning": 0, "afternoon": 0, "evening": 15, "night": 0}
Thus the output should be in the form
"Day": {"morning": count, "afternoon": count, "evening": count, "night":count}
For all of the hundreds of JSON values.
My thinking was that I could create 4 bins representing each of the time zones. I would then use two for loops to go through each days values. If the value is in the range of the bucket I would add it to the count. I would then store the day in a dictionary with the values being a dictionary as well. The inner dictionary would consist of the four time zones with the count as the value. I would then return this for the day and restart for each day.
Here's what I have so far, still need to implement the aggregate function.
import json
from datetime import datetime
def cleanStr4SQL(s):
return s.replace("'","`").replace("\n"," ")
def parseCheckinData():
#write code to parse yelp_checkin.JSON
with open('yelp_checkin.JSON') as f:
outfile = open('checkin.txt', 'w')
line = f.readline()
count_line = 0
while line:
data = json.loads(line)
outfile.write(cleanStr4SQL(str(data['business_id'])) + '\t')
outfile.write(aggregate(cleanStr4SQL(str(data['time']))))
line = f.readline()
count_line+=1
print(count_line)
outfile.close()
f.close()
def aggregate(line):
morning = []
afternoon = []
evening = []
night = []
for l in line:
print(l)
I was wondering what the best approach to solving this in python would be.
Any advice is appreciated. I know there is no code, but if someone could point me in a direction that would be great.
Thank you for reading
答案 0 :(得分:1)
这是一种可行的方法。我只用了一个json字符串来尝试它,所以你可能需要扩展它来处理多次出现。
import json
import pandas as pd
jsontxt = '{"time": {"Friday": {"20:00": 2, "19:00": 1, "22:00": 10, "21:00": 5, "23:00": 14, "0:00": 2, "18:00": 2}, "Thursday": {"23:00": 1, "0:00": 1, "19:00": 1, "18:00": 1, "16:00": 2, "22:00": 2}, "Wednesday": {"17:00": 2, "23:00": 3, "16:00": 1, "22:00": 1, "19:00": 1, "21:00": 1}, "Sunday": {"16:00": 2, "17:00": 2, "19:00": 1, "22:00": 4, "21:00": 4, "0:00": 3, "1:00": 2}, "Saturday": {"21:00": 4, "20:00": 3, "23:00": 10, "22:00": 7, "18:00": 1, "15:00": 2, "16:00": 1, "17:00": 1, "0:00": 8, "1:00": 1}, "Tuesday": {"19:00": 1, "17:00": 1, "1:00": 2, "21:00": 1, "23:00": 3}, "Monday": {"18:00": 2, "23:00": 1, "22:00": 2}}}'
# Parse the json and convert to a dictionary object
jsondict = json.loads(jsontxt)
# Convert the "time" element in the dictionary to a pandas DataFrame
df = pd.DataFrame(jsondict['time'])
# Define a function to convert the time slots to the categories
def cat(time_slot):
if '06:00' <= time_slot < '12:00':
return 'Morning'
elif '12:00' <= time_slot < '17:00':
return 'Afternoon'
elif '17:00' <= time_slot < '23:00':
return 'Evening'
else:
return 'Night'
# Add a new column "Time" to the DataFrame and set the values after left padding the values in the index
df['Time'] = df.index.str.rjust(5,'0')
# Add a new column "Category" and the set the values based on the time slot
df['Category'] = df['Time'].apply(cat)
# Create a pivot table based on the "Category" column
pt = df.pivot_table(index='Category', aggfunc=sum, fill_value=0)
# Convert the pivot table to a dictionary to get the json output you want
jsonoutput = pt.to_dict()
print(jsonoutput)
希望有所帮助