用熊猫读取高度非结构化的csv文件

时间:2019-10-09 14:18:50

标签: python-3.x pandas

我目前正在尝试从4chan解析csv并将其读入大熊猫,这被证明是一项极其不平凡的任务:

有几种不同类型的定界符,并且许多列值都是空白的,因此并非每个值都应填充。

我当前的代码如下:

name_cols = ['num', 'subnum', 'thread_num', 'op', 'timestamp', 'timestamp_expired', 'preview_orig', 'preview_w',
                 'preview_h', 'media_filename', 'media_w', 'media_h', 'media_size', 'media_hash', 'media_orig',
                 'spoiler',
                 'deleted', 'capcode', 'email', 'name', 'trip', 'title', 'comment', 'sticky', 'locked', 'poster_hash',
                 'poster_country', 'exif'
                 ]

    cols = ['num', 'timestamp', 'subnum', 'thread_num', 'email', 'name', 'title', 'comment', 'poster_country']

    keep = "'"

    col_d_types = {'num': object, 'timestamp': object,
                   'subnum': object, 'thread_num': object,
                   'email': object, 'name': object, 'title': object,
                   'comment': object, 'poster_country': object
                   }


    df_chunk = pd.read_csv('pol.csv',
                           names=name_cols,
                           usecols=cols,
                           sep=",\s+",
                           skip_blank_lines=True,
                           na_values=['\\'],
                           engine='python',
                           error_bad_lines=False,
                           iterator=True,
                           dtype=col_d_types,  # kwarg is memory saver, low_memory=True *should* be deprecated
                           chunksize=1000)

这是原始csv的输出结果,因为您可以看到许多值都用双引号""0""括起来了,因为一些行中有新行,并且有几行空白逗号定界符,,,,,,,当前实际上并没有将各自的值放入它们的assign列中,它们只是作为一个值而被放入num列中,例如:

num
"229000,I'm Houston area. And thank you,"that's actually very helpful."",""0"",""0"",\N,"""",\N"

与我实际想要的相反:

num     subnum   thread_num   timestamp   email   name   title   comment        
229000  NaN      ""0""        ""0""       \N      """"   \N      I'm Houston area. And thank you,"that's actually very helpful.

示例输出(不是很漂亮),“空白”仅涵盖了名词:

,num,subnum,thread_num,timestamp,email,name,title,comment,poster_country
229000,I'm Houston area. And thank you,"that's actually very helpful."",""0"",""0"",\N,"""",\N",,,,,,,
229001,"""23655119"",""0"",""23653143"",""0"",""1385858516"",""0"",\N,""0"",""0"",\N,""0"",""0"",""0"",\N,\N,""0"",""0"",""N"",\N,""Anonymous"",\N,\N,"">>23655073\",,,,,,,,
229002,"Your attempt to counter argue his \""evidence\"" with anecdotal evidence is even worse."",""0"",""0"",\N,""NZ"",\N",,,,,,,,
229003,"""23655118"",""0"",""23654317"",""0"",""1385858514"",""0"",\N,""0"",""0"",\N,""0"",""0"",""0"",\N,\N,""0"",""0"",""N"",\N,""Anonymous"",\N,\N,"">>23654682\",,,,,,,,
229004,He is actually right. Once you blank the the lead blank the rest follow the next in line. In which you can keep blank them in succession. \,,,,,,,,
229005,,,,,,,,,
229006,"captcha; cavarit arsenic"",""0"",""0"",\N,"""",\N",,,,,,,,
229007,"""23655123"",""0"",""23651207"",""0"",""1385858518"",""0"",\N,""0"",""0"",\N,""0"",""0"",""0"",\N,\N,""0"",""0"",""N"",\N,""Anonymous"",\N,\N,"">>23654818\",,,,,,,,
229008,"3. Blank said \""blank obey obey your blank\""\",,,,,,,,
229009,,,,,,,,,
229010,"and lets not forget that blank blank his own blank as a loophole for his own rules,\",,,,,,,,

如果有人曾经解析过这种非结构化的内容,将非常感谢您的帮助!

0 个答案:

没有答案