使用Python

时间:2016-04-15 04:20:20

标签: python mongodb csv

我有一个CSV文件,其中包含带逗号的字段/列(",")。我将此CSV加载到mongodb中以进行数据操作。我想从逗号中删除所有文本到右边,只留下逗号左边的文本。

完成此任务的最有效方法是什么?在我的mongodb csv导入脚本中(我使用pandas)?之后数据已经在MongoDB中?老实说,我是编程新手,想知道如何在任何一个场景中做到这一点,但我希望看到一个最有效的解决方案。

这是我的csv到python导入脚本:

#!/usr/bin/env python
import sys
import os
import pandas as pd
import pymongo
import json

def import_content(filepath):
    mng_client = pymongo.MongoClient('localhost', 27017)
    mng_db = mng_client['swx_inv']
    collection_name = 'device.switch'
    db_cm = mng_db[collection_name]
    cdir = os.path.dirname(__file__)
    file_res = os.path.join(cdir, filepath)

data = pd.read_csv(file_res, skiprows=2, skip_footer=1)
data_json = json.loads(data.to_json(orient='records'))
db_cm.remove()
db_cm.insert(data_json)

if __name__ == "__main__":
    filepath = '/vagrant/data/DeviceInventory-Category.Switch.csv'
    import_content(filepath)

以下是CSV的前三行供参考。我试图改变最后一个字段," OS Image":

Device,Serial Number,Realm,Vendor,Model,OS Image
ABBNWX0100,SMG3453ESDN,BlAH BLAH,Cisco,WS-C6509-E,"IOS 12.2(33)SXI9, s72033_rp-ADVENTERPRISEK9_WAN-M"
ABBNWX0101,SDG127343S0,BLAH BLAH,Cisco,WS-C4506-E,"IOS 12.2(53)SG8, cat4500-IPBASEK9-M"
ABBNWX0102,TREFDSFY1KK,BLAH BLAH,Cisco,WS-C3560V2-48PS-S,"IOS 12.2(55)SE5, C3560-IPBASEK9-M"
编辑:在上传到mongoDB集合之前,我找到了一种通过pandas做我需要的方法。我必须这样做两次,因为保存列数据使用两个不同的分隔符,正则表达式无法正常工作:

# Use pandas to read CSV, skipping top 2 lines & footer line from
# CSV export. Set column data to string type.
data = pd.read_csv(
    file_res, index_col=False, skiprows=2,
    skip_footer=1, dtype={'Device': str, 'Serial Number': str,
                          'Realm': str, 'Vendor': str, 'Model': str,
                          'OS Image': str}
)
# Drop rows where Serial Number is empty
data = data.dropna(subset=['Serial Number'])

# Split the OS Image column by "," and ";" to remove extraneous data
data['OS Image'].update(data['OS Image'].apply(
    lambda x: x.split(",")[0] if len(x.split()) > 1 else None)
)
data['OS Image'].update(data['OS Image'].apply(
    lambda x: x.split(";")[0] if len(x.split()) > 1 else None)
)

1 个答案:

答案 0 :(得分:1)

import csv

s='''Device,Serial Number,Realm,Vendor,Model,OS Image
ABBNWX0100,SMG3453ESDN,BlAH BLAH,Cisco,WS-C6509-E,"IOS 12.2(33)SXI9, s72033_rp-ADVENTERPRISEK9_WAN-M"
ABBNWX0101,SDG127343S0,BLAH BLAH,Cisco,WS-C4506-E,"IOS 12.2(53)SG8, cat4500-IPBASEK9-M"
ABBNWX0102,TREFDSFY1KK,BLAH BLAH,Cisco,WS-C3560V2-48PS-S,"IOS 12.2(55)SE5, C3560-IPBASEK9-M"'''

print("\n".join([','.join(row[:5])+","+str(row[5].split(",")[0]) for row in csv.reader(s.split("\n"))]))

将列表理解转换为循环以提高可读性:

newtext=""
for row in csv.reader(s.split("\n")):
    newtext+=','.join(row[:5])+","+str(row[5].split(",")[0])+"\n"
print(newtext)

输出:

Device,Serial Number,Realm,Vendor,Model,OS Image
ABBNWX0100,SMG3453ESDN,BlAH BLAH,Cisco,WS-C6509-E,IOS 12.2(33)SXI9
ABBNWX0101,SDG127343S0,BLAH BLAH,Cisco,WS-C4506-E,IOS 12.2(53)SG8
ABBNWX0102,TREFDSFY1KK,BLAH BLAH,Cisco,WS-C3560V2-48PS-S,IOS 12.2(55)SE5

https://ideone.com/FMJCrO

对于文件,您必须使用

with open(fname) as f:
    content = f.readlines()

content将包含文件中的行列表,然后使用csv.reader(content)