我目前正在尝试从4chan解析csv并将其读入大熊猫,这被证明是一项极其不平凡的任务:
有几种不同类型的定界符,并且许多列值都是空白的,因此并非每个值都应填充。
我当前的代码如下:
name_cols = ['num', 'subnum', 'thread_num', 'op', 'timestamp', 'timestamp_expired', 'preview_orig', 'preview_w',
'preview_h', 'media_filename', 'media_w', 'media_h', 'media_size', 'media_hash', 'media_orig',
'spoiler',
'deleted', 'capcode', 'email', 'name', 'trip', 'title', 'comment', 'sticky', 'locked', 'poster_hash',
'poster_country', 'exif'
]
cols = ['num', 'timestamp', 'subnum', 'thread_num', 'email', 'name', 'title', 'comment', 'poster_country']
keep = "'"
col_d_types = {'num': object, 'timestamp': object,
'subnum': object, 'thread_num': object,
'email': object, 'name': object, 'title': object,
'comment': object, 'poster_country': object
}
df_chunk = pd.read_csv('pol.csv',
names=name_cols,
usecols=cols,
sep=",\s+",
skip_blank_lines=True,
na_values=['\\'],
engine='python',
error_bad_lines=False,
iterator=True,
dtype=col_d_types, # kwarg is memory saver, low_memory=True *should* be deprecated
chunksize=1000)
这是原始csv的输出结果,因为您可以看到许多值都用双引号""0""
括起来了,因为一些行中有新行,并且有几行空白逗号定界符,,,,,,,
当前实际上并没有将各自的值放入它们的assign列中,它们只是作为一个值而被放入num
列中,例如:
num
"229000,I'm Houston area. And thank you,"that's actually very helpful."",""0"",""0"",\N,"""",\N"
与我实际想要的相反:
num subnum thread_num timestamp email name title comment
229000 NaN ""0"" ""0"" \N """" \N I'm Houston area. And thank you,"that's actually very helpful.
示例输出(不是很漂亮),“空白”仅涵盖了名词:
,num,subnum,thread_num,timestamp,email,name,title,comment,poster_country
229000,I'm Houston area. And thank you,"that's actually very helpful."",""0"",""0"",\N,"""",\N",,,,,,,
229001,"""23655119"",""0"",""23653143"",""0"",""1385858516"",""0"",\N,""0"",""0"",\N,""0"",""0"",""0"",\N,\N,""0"",""0"",""N"",\N,""Anonymous"",\N,\N,"">>23655073\",,,,,,,,
229002,"Your attempt to counter argue his \""evidence\"" with anecdotal evidence is even worse."",""0"",""0"",\N,""NZ"",\N",,,,,,,,
229003,"""23655118"",""0"",""23654317"",""0"",""1385858514"",""0"",\N,""0"",""0"",\N,""0"",""0"",""0"",\N,\N,""0"",""0"",""N"",\N,""Anonymous"",\N,\N,"">>23654682\",,,,,,,,
229004,He is actually right. Once you blank the the lead blank the rest follow the next in line. In which you can keep blank them in succession. \,,,,,,,,
229005,,,,,,,,,
229006,"captcha; cavarit arsenic"",""0"",""0"",\N,"""",\N",,,,,,,,
229007,"""23655123"",""0"",""23651207"",""0"",""1385858518"",""0"",\N,""0"",""0"",\N,""0"",""0"",""0"",\N,\N,""0"",""0"",""N"",\N,""Anonymous"",\N,\N,"">>23654818\",,,,,,,,
229008,"3. Blank said \""blank obey obey your blank\""\",,,,,,,,
229009,,,,,,,,,
229010,"and lets not forget that blank blank his own blank as a loophole for his own rules,\",,,,,,,,
如果有人曾经解析过这种非结构化的内容,将非常感谢您的帮助!