处理一个非常大的数据帧

时间:2016-03-06 16:38:10

标签: python pandas bigdata

现在,我在如何处理数据并将其转换为数据框方面遇到了麻烦。基本上我要做的就是先读取数据

data = pd.read_csv(querylog, sep=" ", header=None)

然后将其分组

query_group = data.groupby('Query')
ip_group = data.groupby('IP')

最后创建一个空白数据框来映射其值

df = pd.DataFrame(columns=query_group.groups, index=range(0, len(ip_group.groups)))

index = 0
for name, group in ip_group:
    df.set_value(index, 'IP', name)
    index += 1
df = df.set_index('IP')

for index, row in data.iterrows():
    df.set_value(row['IP'], row['Query'], 1)
    print(index)
df = df.fillna(0)

所以我的问题是 ip_group可以达到6000 query_group高达400000 这会导致我的内存产生一个非常大的空白数据帧不能掌握。任何人都可以帮我解决这个问题吗?任何帮助表示赞赏。

数据的示例数据框看起来像这样

data = pd.DataFrame( { "Query" : ["google.com", "youtube.com", "facebook.com"],
     "IP" : ["192.168.0.104", "192.168.0.103","192.168.0.104"] } )

我的预期输出看起来像这样

                google.com youtube.com  facebook.com
IP            
192.168.0.104   1          0             1
192.168.0.103   0          1             0

1 个答案:

答案 0 :(得分:0)

IIUC你可以使用get_dummies,但没有数据是有问题的,找到最佳解决方案:

df = pd.get_dummies(data.set_index('IP')['Query'])
print df.groupby(df.index).sum()

样品:

import pandas as pd

data = pd.DataFrame( { "Query" : ["a", "b", "c", "d", "a" , "b"],
     "IP" : [1,5,4,8,3,4] } )
print data  
   IP Query
0   1     a
1   5     b
2   4     c
3   8     d
4   3     a
5   4     b
#set index from column data
data = data.set_index('IP')

#get dummies from column Query
df = pd.get_dummies(data['Query'])
print df
    a  b  c  d
IP            
1   1  0  0  0
5   0  1  0  0
4   0  0  1  0
8   0  0  0  1
3   1  0  0  0
4   0  1  0  0

#groupby by index and sum columns
print df.groupby(df.index).sum()
    a  b  c  d
IP            
1   1  0  0  0
3   1  0  0  0
4   0  1  1  0
5   0  1  0  0
8   0  0  0  1

尝试按astype转换为int8,以节省3倍的内存:

print pd.get_dummies(data['Query']).info()
<class 'pandas.core.frame.DataFrame'>
Index: 3 entries, 192.168.0.104 to 192.168.0.104
Data columns (total 3 columns):
facebook.com    3 non-null float64
google.com      3 non-null float64
youtube.com     3 non-null float64
dtypes: float64(3)
memory usage: 96.0+ bytes

print pd.get_dummies(data['Query']).astype(np.int8).info()
<class 'pandas.core.frame.DataFrame'>
Index: 3 entries, 192.168.0.104 to 192.168.0.104
Data columns (total 3 columns):
facebook.com    3 non-null int8
google.com      3 non-null int8
youtube.com     3 non-null int8
dtypes: int8(3)
memory usage: 33.0+ bytes

print pd.get_dummies(data['Query'], sparse=True).info()
<class 'pandas.sparse.frame.SparseDataFrame'>
Index: 3 entries, 192.168.0.104 to 192.168.0.104
Data columns (total 3 columns):
facebook.com    3 non-null float64
google.com      3 non-null float64
youtube.com     3 non-null float64