将非结构化csv读入Python Pandas数据帧

时间:2017-03-08 01:55:50

标签: python csv pandas

我有一个非结构化的csv文件,我想读入Pandas数据框。

以下是csv示例:

Build fingerprint: 'Verizon/kltevzw/kltevzw:6.0.1/MMB29M/G900VVRS2DQB2:user/release-keys' Revision: '14' ABI: 'arm' pid: 12922, tid: 12922, name: m.ex.exp >>> com.ex.exp <<< signal 6 (SIGABRT), code -6 (SI_TKILL), fault addr -------- Abort message: 'art/runtime/class_linker.cc:2502] Check failed: dex_cache.Get() != nullptr Failed to allocate dex cache for /data/app/com.google.android.gms-2/base.apk:classes3.dex' r0 00000000 r1 0000327a r2 00000006 r3 b6f13b84 r4 b6f13b8c r5 b6f13b3c r6 00000001 r7 0000010c r8 b3fbf800 r9 b3fbde44 sl 9927d3db fp b3fa3480 ip 00000006 sp becac340 lr b6c74c09 pc b6c76ff8 cpsr 40070010 backtrace: #00 pc 00041ff8 /system/lib/libc.so (tgkill+12) #01 pc 0003fc05 /system/lib/libc.so (pthread_kill+32) #02 pc 0001c38b /system/lib/libc.so (raise+10) #03 pc 00019609 /system/lib/libc.so (__libc_android_abort+34) #04 pc 0001755c /system/lib/libc.so (abort+4) #05 pc 003341ad /system/lib/libart.so (_ZN3art7Runtime5AbortEv+228) #06 pc 000f477b /system/lib/libart.so (_ZN3art10LogMessageD2Ev+2226) #07 pc 000f0a51 /system/lib/libart.so (_ZN3art7BarrierD2Ev+216) #08 pc 0035c201 /system/lib/libart.so (_ZN3art10ThreadList4DumpERNSt3__113basic_ostreamIcNS1_11char_traitsIcEEEE+188) #09 pc 00334275 /system/lib/libart.so (_ZN3art7Runtime5AbortEv+428) #10 pc 000f477b /system/lib/libart.so (_ZN3art10LogMessageD2Ev+2226) #11 pc 0012fb57 /system/lib/libart.so (_ZN3art11ClassLinker15RegisterDexFileERKNS_7DexFileE+546) #12 pc 001360b1 /system/lib/libart.so (_ZN3art11ClassLinker26FindClassInPathClassLoaderERNS_33ScopedObjectAccessAlreadyRunnableEPNS_6ThreadEPKcjNS_6HandleINS_6mirror11ClassLoaderEEEPPNS8_5ClassE+704) #13 pc 001363b7 /system/lib/libart.so (_ZN3art11ClassLinker9FindClassEPNS_6ThreadEPKcNS_6HandleINS_6mirror11ClassLoaderEEE.part.589+522) #14 pc 00136c7b /system/lib/libart.so (_ZN3art11ClassLinker9FindClassEPNS_6ThreadEPKcNS_6HandleINS_6mirror11ClassLoaderEEE+50) #15 pc 00139b31 /system/lib/libart.so (_ZN3art11ClassLinker11ResolveTypeERKNS_7DexFileEtNS_6HandleINS_6mirror8DexCacheEEENS4_INS5_11ClassLoaderEEE+160) #16 pc 0013b8f5 /system/lib/libart.so (_ZN3art11ClassLinker13ResolveMethodERKNS_7DexFileEjNS_6HandleINS_6mirror8DexCacheEEENS4_INS5_11ClassLoaderEEEPNS_9ArtMethodENS_10InvokeTypeE+140) #17 pc 003fc9e5 /system/lib/libart.so (_ZN3art11ClassLinker13ResolveMethodEPNS_6ThreadEjPNS_9ArtMethodENS_10InvokeTypeE.part.125+68) #18 pc 00400805 /system/lib/libart.so (artQuickResolutionTrampoline+2636) #19 pc 000eab23 /system/lib/libart.so (art_quick_resolution_trampoline+34) #20 pc 049eb5f3 /data/app/com.google.android.gms-2/oat/arm/base.odex (offset 0x3209000)

每行代表客户,与客户关联的帐户以及与客户帐户相关的变量。客户没有相同的帐户,因此我们无法保证获得具有相同帐户的行。

如果帐户存在,则帐户名称后面会有预定数量的变量(在这种情况下,每个帐户为2个)。但是,即使存在帐户,也可能缺少与帐户相关的一些变量(例如,customer_id为888)。

如果客户没有帐户,则不会出现在客户的记录中。

以下是所需数​​据框的样子:

customer_id,123,acct1,1000,10,acct2,2000,20,acct3,3000,30 customer_id,456,acct1,4000,40,acct2,5000,50 customer_id,789,acct3,6000,60 customer_id,888,acct1,7000,,acct2,,70 customer_id,999

上述数据框将有七列。它将填充NaN,其中帐户不存在或帐户的变量丢失。

谢谢!

1 个答案:

答案 0 :(得分:0)

enter image description here您将通过这种方式获得所需的结果,使用pandas: 您可以根据使用情况优化代码,

import csv, json
import pandas as pd

raw_data = {}
with open('input_dataframe.csv', 'rb') as csvfile:
    spamreader = csv.reader(csvfile, delimiter=' ', quotechar='|')
    raw_data['customer_id'] = []
    raw_data['acct1_bal'] = []    
    raw_data['acct_1_del'] = []
    raw_data['acct2_bal'] = []
    raw_data['acct_2_del'] = []
    raw_data['acct3_bal'] = []
    raw_data['acct_3_del'] = []
    for row in spamreader:
        raw_data['customer_id'].append((row[0]).split(',')[1])
        raw_data['acct1_bal'].append((row[0]).split(',')[3])
        raw_data['acct_1_del'].append((row[0]).split(',')[4])
        raw_data['acct2_bal'].append((row[0]).split(',')[6])
        raw_data['acct_2_del'].append((row[0]).split(',')[7])
        raw_data['acct3_bal'].append((row[0]).split(',')[9])
        raw_data['acct_3_del'].append((row[0]).split(',')[10])

df = pd.DataFrame(raw_data, columns = ['customer_id', 'acct1_bal', 'acct_1_del', 'acct2_bal', 'acct_2_del', 'acct3_bal', 'acct_3_del']).replace('','NaNs')
df.to_csv('output_dataframe.csv')