现在,我在如何处理数据并将其转换为数据框方面遇到了麻烦。基本上我要做的就是先读取数据
data = pd.read_csv(querylog, sep=" ", header=None)
然后将其分组
query_group = data.groupby('Query')
ip_group = data.groupby('IP')
最后创建一个空白数据框来映射其值
df = pd.DataFrame(columns=query_group.groups, index=range(0, len(ip_group.groups)))
index = 0
for name, group in ip_group:
df.set_value(index, 'IP', name)
index += 1
df = df.set_index('IP')
for index, row in data.iterrows():
df.set_value(row['IP'], row['Query'], 1)
print(index)
df = df.fillna(0)
所以我的问题是 ip_group可以达到6000 和 query_group高达400000 这会导致我的内存产生一个非常大的空白数据帧不能掌握。任何人都可以帮我解决这个问题吗?任何帮助表示赞赏。
数据的示例数据框看起来像这样
data = pd.DataFrame( { "Query" : ["google.com", "youtube.com", "facebook.com"],
"IP" : ["192.168.0.104", "192.168.0.103","192.168.0.104"] } )
我的预期输出看起来像这样
google.com youtube.com facebook.com
IP
192.168.0.104 1 0 1
192.168.0.103 0 1 0
答案 0 :(得分:0)
IIUC你可以使用get_dummies
,但没有数据是有问题的,找到最佳解决方案:
df = pd.get_dummies(data.set_index('IP')['Query'])
print df.groupby(df.index).sum()
样品:
import pandas as pd
data = pd.DataFrame( { "Query" : ["a", "b", "c", "d", "a" , "b"],
"IP" : [1,5,4,8,3,4] } )
print data
IP Query
0 1 a
1 5 b
2 4 c
3 8 d
4 3 a
5 4 b
#set index from column data
data = data.set_index('IP')
#get dummies from column Query
df = pd.get_dummies(data['Query'])
print df
a b c d
IP
1 1 0 0 0
5 0 1 0 0
4 0 0 1 0
8 0 0 0 1
3 1 0 0 0
4 0 1 0 0
#groupby by index and sum columns
print df.groupby(df.index).sum()
a b c d
IP
1 1 0 0 0
3 1 0 0 0
4 0 1 1 0
5 0 1 0 0
8 0 0 0 1
尝试按astype
转换为int8
,以节省3倍的内存:
print pd.get_dummies(data['Query']).info()
<class 'pandas.core.frame.DataFrame'>
Index: 3 entries, 192.168.0.104 to 192.168.0.104
Data columns (total 3 columns):
facebook.com 3 non-null float64
google.com 3 non-null float64
youtube.com 3 non-null float64
dtypes: float64(3)
memory usage: 96.0+ bytes
print pd.get_dummies(data['Query']).astype(np.int8).info()
<class 'pandas.core.frame.DataFrame'>
Index: 3 entries, 192.168.0.104 to 192.168.0.104
Data columns (total 3 columns):
facebook.com 3 non-null int8
google.com 3 non-null int8
youtube.com 3 non-null int8
dtypes: int8(3)
memory usage: 33.0+ bytes
print pd.get_dummies(data['Query'], sparse=True).info()
<class 'pandas.sparse.frame.SparseDataFrame'>
Index: 3 entries, 192.168.0.104 to 192.168.0.104
Data columns (total 3 columns):
facebook.com 3 non-null float64
google.com 3 non-null float64
youtube.com 3 non-null float64