如何从.csv文件中读取并在python中转换为.json(具有不同的数据结构)?

时间:2017-04-13 11:51:04

标签: json mongodb python-2.7 csv

尝试编写一个python脚本,允许我读取.csv文件并将值混合到.json中的特定格式/数据结构中,然后我可以将其导入到mongoDB中。我正在使用行人数据作为我的数据集,并且有超过一百万条带有冗余数据的条目。我坚持编写实际的脚本并将其转换为我想要的.json格式。

data.csv - 以表格格式,以便于阅读和原始

csv table

Id,Date_Time,Year,Month,Mdate,Day,Time,Sensor_ID,Sensor_Name,Hourly_Counts
1, 01-JUN-2009 00:00,2009,June,1,Monday,0,4,Town Hall (West),194
2, 01-JUN-2009 00:00,2009,June,1,Monday,0,17,Collins Place (South),21
3, 01-JUN-2009 00:00,2009,June,1,Monday,0,18,Collins Place (North),9
4, 01-JUN-2009 00:00,2009,June,1,Monday,0,16,Australia on Collins,39
5, 01-JUN-2009 00:00,2009,June,1,Monday,0,2,Bourke Street Mall (South),28
6, 01-JUN-2009 00:00,2009,June,1,Monday,0,1,Bourke Street Mall (North),37
7, 01-JUN-2009 00:00,2009,June,1,Monday,0,13,Flagstaff Station,1
8, 01-JUN-2009 00:00,2009,June,1,Monday,0,3,Melbourne Central,155
9, 01-JUN-2009 00:00,2009,June,1,Monday,0,15,State Library,98
10, 01-JUN-2009 00:00,2009,June,1,Monday,0,9,Southern Cross Station,7
11, 01-JUN-2009 00:00,2009,June,1,Monday,0,10,Victoria Point,8
12, 01-JUN-2009 00:00,2009,June,1,Monday,0,12,New Quay,30

因为我要上传到mongoDB,我的上下文中的Id对我来说是多余的,所以我需要我的脚本跳过它。 Sensor_ID并不是唯一的,但我打算将其作为PK并创建一个区分Hourly_Count的对象列表。

我的目标是从数据中生成这样的JSON对象:

**data.json**

    {
        {
        "Sensor_ID": 4,
        "Sensor_Name": "Town Hall(West)",
        "countList": 
             [
                 {
                     "Date_Time": "01-JUN-2009 00:00",
                     "Year":2009,
                     "Month": "June",
                     "Mdate": 1,
                     "Day": "Monday",
                     "Time": 0,
                     "Hourly_Counts": 194
                 },
                 {
                     "Date_Time": "01-JUN-2009 00:00",
                     "Year":2009,
                     "Month": "June",
                     "Mdate": 1,
                     "Day": "Monday",
                     "Time": 1,
                     "Hourly_Counts": 82
                 }
             ]
        },
        {
        "Sensor_ID": 17,
        "Sensor_Name": "Collins Place(North)",
        "countList": 
             [
                 {
                     "Date_Time": "01-JUN-2009 00:00",
                     "Year":2009,
                     "Month": "June",
                     "Mdate": 1,
                     "Day": "Monday",
                     "Time": 0,
                     "Hourly_Counts": 21
                 }
             ]
        }
    }

等等。我试图这样做,当它读取Sensor_ID时,它会从列出的字段中创建一个json对象,并将其添加到countList。已从station_ID = 4添加到countList

我正在使用python 2.7.x,我在stackoverflow和其他所有网站上查看过有关此问题的所有问题。很少有人在转换为.json时很少想要重构.csv数据,所以这有点困难。

到目前为止,我对python仍然相对较新,所以认为这样做会很好。

csvtojson.py

import csv
import json

def csvtojson():

    filename = 'data.csv'
    fieldnames = ('Id','Date_Time','Year','Month','Mdate','Day',
    'Time','Sensor_ID','Sensor_Name', 'Hourly_Counts')

    dataTime = ('Date_Time','Year','Month','Mdate','Day',
    'Time', 'Hourly_Counts')

    all_data = {}

    with open(filename, 'rb') as csvfile:
        reader = csv.DictReader(csvfile, fieldnames)
        #skip header
        next(reader)
        current_sensorID = None
        for row in reader:
            sensor_ID = row['Sensor_ID']
            sensorName = row['Sensor_Name']
            data = all_data[sensor_ID] = {}
            data['dataTime'] = dict((k, row[k]) for k in dataTime)


        print json.dumps(all_data, indent=4, sort_keys=True)    

if __name__ == "__main__":

    csvtojson()

据我所知,countList是自己的对象,但它不是创建对象列表,可能会导致导入到mongoDB。它通过Sensor_ID进行过滤,但如果存在重复项而不是添加到countList则会覆盖。我似乎无法在我想要的格式/数据结构中得到它 - 我甚至不确定这是否是正确的结构,最终目标是像我列出的那样将数百万个元组导入mongoDB。我现在正在尝试一套小套装来测试它。

2 个答案:

答案 0 :(得分:0)

请检查以下内容。

https://github.com/gurdyals/test-repo/tree/master/MongoDB

使用“MongoDB_py.zip”文件。

我也将csv数据转换为MongoDB dict。

如果您有任何疑问,请与我们联系。

由于

答案 1 :(得分:0)

以下是使用python pandas执行类似上述操作的示例代码。如果您希望汇总数据以消除冗余数据,您还可以在数据框中进行一些聚合。

import pandas as pd
import pprint as pp
import json
from collections import defaultdict

results = defaultdict(lambda: defaultdict(dict))

df = pd.read_csv('data.csv')
df.set_index(['Sensor_ID', 'Sensor_Name'],inplace=True)
df.reset_index(inplace=True)
grouped = df.groupby(['Sensor_ID', 'Sensor_Name']).apply(lambda x: x.drop(['Sensor_ID', 'Sensor_Name'], axis=1).to_json(orient='records'))
grouped.name = 'countList'
js = json.loads(pd.DataFrame(grouped).reset_index().to_json(orient='records'))
print json.dumps(js, indent = 4)

输出:

[
    {
        "Sensor_ID": 1, 
        "countList": "[{\"Id\":6,\"Date_Time\":\" 01-JUN-2009 00:00\",\"Year\":2009,\"Month\":\"June\",\"Mdate\":1,\"Day\":\"Monday\",\"Time\":0,\"Hourly_Counts\":37}]", 
        "Sensor_Name": "Bourke Street Mall (North)"
    }, 
    {
        "Sensor_ID": 2, 
        "countList": "[{\"Id\":5,\"Date_Time\":\" 01-JUN-2009 00:00\",\"Year\":2009,\"Month\":\"June\",\"Mdate\":1,\"Day\":\"Monday\",\"Time\":0,\"Hourly_Counts\":28}]", 
        "Sensor_Name": "Bourke Street Mall (South)"
    }, 
    {
        "Sensor_ID": 3, 
        "countList": "[{\"Id\":8,\"Date_Time\":\" 01-JUN-2009 00:00\",\"Year\":2009,\"Month\":\"June\",\"Mdate\":1,\"Day\":\"Monday\",\"Time\":0,\"Hourly_Counts\":155}]", 
        "Sensor_Name": "Melbourne Central"
    }, 
    {
        "Sensor_ID": 4, 
        "countList": "[{\"Id\":1,\"Date_Time\":\" 01-JUN-2009 00:00\",\"Year\":2009,\"Month\":\"June\",\"Mdate\":1,\"Day\":\"Monday\",\"Time\":0,\"Hourly_Counts\":194}]", 
        "Sensor_Name": "Town Hall (West)"
    },