我在从4chan存档评论中读取准确的信息时遇到问题。由于4chan线程的线程结构无法(似乎)很好地转换为矩形数据框,因此我遇到了问题,实际上是如何将每个线程的相应注释放入熊猫的单行中。
为了加剧该问题,数据集的大小为54GB,我问了类似的question,有关如何将数据读入pandas数据框(该问题的解决方案使我意识到了这个问题),诊断每个乏味的问题。
我用于读取部分数据的代码如下:
def Four_pleb_chunker():
"""
:return: 4pleb data is over 54 GB so this chunks it into something manageable
"""
with open('pol.csv') as f:
with open('pol_part.csv', 'w') as g:
for i in range(1000): ready
g.write(f.readline())
name_cols = ['num', 'subnum', 'thread_num', 'op', 'timestamp', 'timestamp_expired', 'preview_orig', 'preview_w', 'preview_h',
'media_filename', 'media_w', 'media_h', 'media_size', 'media_hash', 'media_orig', 'spoiler', 'deleted', 'capcode',
'email', 'name', 'trip', 'title', 'comment', 'sticky', 'locked', 'poster_hash', 'poster_country', 'exif']
cols = ['num','timestamp', 'email', 'name', 'title', 'comment', 'poster_country']
df_chunk = pd.read_csv('pol_part.csv',
names=name_cols,
delimiter=None,
usecols=cols,
skip_blank_lines=True,
engine='python',
error_bad_lines=False)
df_chunk = df_chunk.rename(columns={"comment": "Comments"})
df_chunk = df_chunk.dropna(subset=['Comments'])
df_chunk['Comments'] = df_chunk['Comments'].str.replace('[^0-9a-zA-Z]+', ' ')
df_chunk.to_csv('pol_part_df.csv')
return df_chunk
此代码可以正常工作,但是由于每个线程的结构,我编写的解析器有时会返回无意义的结果。在csv格式中,这就是数据集的前几行的样子(请截屏,使用此UI很难将所有这些行写出来。)
可以看出,每个线程的注释都被'\'分隔,但是每个注释都没有占据自己的一行。我的目标是至少将每个注释放入自己的行中,以便我可以正确解析它。但是,我用来解析数据的函数在经过1000次迭代后会中断,无论它是否换行。
从根本上来说,我的问题是:如何构造这些数据以准确地实际阅读注释,并能够在完整的示例数据框中读取内容,而不是截断的数据。至于我尝试过的解决方案:
df_chunk = pd.read_csv('pol_part.csv',
names=name_cols,
delimiter='',
usecols=cols,
skip_blank_lines=True,
engine='python',
error_bad_lines=False)
如果我取消/更改参数delimiter
,则会收到此错误:
Skipping line 31473: ',' expected after '"'
这很有意义,因为数据没有用,
隔开,所以它跳过了不符合该条件的每一行,在这种情况下是整个数据帧。在参数中输入\
会给我带来语法错误。我对于下一步的工作很茫然,因此,如果任何人有处理此类问题的经验,那么您将是一个救命稻草。让我知道这里是否包含任何内容,我将对其进行更新。
更新,这是来自CSV的一些示例行用于测试:
2 23594708 1385716767 \N Anonymous \N Example: not identifying the fundamental scarcity of resources which underlies the entire global power structure, or the huge, documented suppression of any threats to that via National Security Orders. Or that EVERY left/right ideology would be horrible in comparison to ANY in which energy scarcity and the hierarchical power structures dependent upon it had been addressed.
3 23594754 1385716903 \N Anonymous \N ">>23594701\
\
No, /pol/ is bait. That's the point."
4 23594773 1385716983 \N Anonymous \N ">>23594754
\
Being a non-bait among baits is equal to being a bait among non-baits."
5 23594795 1385717052 \N Anonymous \N Don't forget how heavily censored this board is! And nobody has any issues with that.
6 23594812 1385717101 \N Anonymous \N ">>23594773\
\
Clever. The effect is similar. But there are minds on /pol/ who don't WANT to be bait, at least."
答案 0 :(得分:1)
这是一个示例脚本,可将您的csv转换为每个注释的单独行:
import csv
# open file for output and create csv writer
f_out = open('out.csv', 'w')
w = csv.writer(f_out)
# open input file and create reader
with open('test.csv') as f:
r = csv.reader(f, delimiter='\t')
for l in r:
# skip empty lines
if not l:
continue
# in this line I want to split the last part
# and loop over each resulting string
for s in l[-1].split('\\\n'):
# we copy all fields except the last one
output = l[:-1]
# add a single comment
output.append(s)
w.writerow(output)