计算3400万行,我的jupyter实验室内核已死

时间:2019-04-16 04:18:25

标签: python bigdata jupyter

我有3400万行,只有一列。我想将数据帧中的字符串值拆分为某些列,但是花了很长时间,我的jupyter实验室内核在处理它时死了。

这是我的样本数据集(df):

Log
Apr  4 20:30:33 100.51.100.254 dns,packet user: --- got query from 10.5.14.243:30648:
Apr  4 20:30:33 100.51.100.254 dns,packet user: id:78a4 rd:1 tc:0 aa:0 qr:0 ra:0 QUERY 'no error'
Apr  4 20:30:33 100.51.100.254 dns,packet user: question: tracking.intl.miui.com:A:IN
Apr  4 20:30:33 dns user: query from 9.5.10.243: #4746190 tracking.intl.miui.com. A

我想使用以下代码将其分为四列:

df1 = df['Log'].str.split(n=3, expand=True)
df1.columns=['Month','Date','Time','Log']
df1.head()

这是我期望的结果

     Month Date      Time                                              Log
0      Apr    4  20:30:33  100.51.100.254 dns,packet user: --- go...
1      Apr    4  20:30:33  100.51.100.254 dns,packet user: id:78a...
2      Apr    4  20:30:33  100.51.100.254 dns,packet user: questi...
3      Apr    4  20:30:33  dns transjakarta: query from 9.5.10.243: #474...
4      Apr    4  20:30:33  100.51.100.254 dns,packet user: --- se...

当我使用样本800k行时,它成功。但是,当我将其应用于完整数据时,内核就死了。

有解决方案吗?也许使用快或快速?

0 个答案:

没有答案