Python耗尽内存

时间:2016-07-28 15:32:38

标签: python memory scikit-learn

我有以下程序。我运行时,收到Memory Error,特别是Fpred = F.predict(A)(请参见下文)

import json
data = []
with open('yelp_data.json') as f:
    for line in f:
        data.append(json.loads(line))
star = []
for i in range(len(data)):
    star.append(data[i].values()[10])

attributes = []
for i in range(len(data)):
    attributes.append(data[i].values()[12])


def flatten_dict(dd, separator=' ', prefix=''):
    return { prefix + separator + k if prefix else k : v
         for kk, vv in dd.items()
         for k, v in flatten_dict(vv, separator, kk).items()
         } if isinstance(dd, dict) else { prefix : dd }

flatten_attr = list(flatten_dict(attributes[i], separator = ' ', prefix = '') for i in range(len(attributes)))


from sklearn.feature_extraction import DictVectorizer
v = DictVectorizer(sparse = False)
X = v.fit_transform(flatten_attr)

from sklearn.feature_extraction.text import TfidfTransformer
Transformer = TfidfTransformer()
A = Transformer.fit_transform(X)

from sklearn.linear_model import LinearRegression
from sklearn.cross_validation import train_test_split

from sklearn.neighbors import KNeighborsRegressor
from sklearn.cross_validation import KFold

F = KNeighborsRegressor(n_neighbors = 27)

Ffit = F.fit(A, star)
Fpred = F.predict(A)
Score = F.score(A, star)
print(Score)

我的json文件看起来像这样 -

{"business_id": "vcNAWiLM4dR7D2nwwJ7nCA", "full_address": "4840 E Indian School Rd\nSte 101\nPhoenix, AZ 85018", "hours": {"Tuesday": {"close": "17:00", "open": "08:00"}, "Friday": {"close": "17:00", "open": "08:00"}, "Monday": {"close": "17:00", "open": "08:00"}, "Wednesday": {"close": "17:00", "open": "08:00"}, "Thursday": {"close": "17:00", "open": "08:00"}}, "open": true, "categories": ["Doctors", "Health & Medical"], "city": "Phoenix", "review_count": 7, "name": "Eric Goldberg, MD", "neighborhoods": [], "longitude": -111.98375799999999, "state": "AZ", "stars": 3.5, "latitude": 33.499313000000001, "attributes": {"By Appointment Only": true}, "type": "business"}
{"business_id": "JwUE5GmEO-sH1FuwJgKBlQ", "full_address": "6162 US Highway 51\nDe Forest, WI 53532", "hours": {}, "open": true, "categories": ["Restaurants"], "city": "De Forest", "review_count": 26, "name": "Pine Cone Restaurant", "neighborhoods": [], "longitude": -89.335843999999994, "state": "WI", "stars": 4.0, "latitude": 43.238892999999997, "attributes": {"Take-out": true, "Good For": {"dessert": false, "latenight": false, "lunch": true, "dinner": false, "breakfast": false, "brunch": false}, "Caters": false, "Noise Level": "average", "Takes Reservations": false, "Delivery": false, "Ambience": {"romantic": false, "intimate": false, "touristy": false, "hipster": false, "divey": false, "classy": false, "trendy": false, "upscale": false, "casual": false}, "Parking": {"garage": false, "street": false, "validated": false, "lot": true, "valet": false}, "Has TV": true, "Outdoor Seating": false, "Attire": "casual", "Alcohol": "none", "Waiter Service": true, "Accepts Credit Cards": true, "Good for Kids": true, "Good For Groups": true, "Price Range": 1}, "type": "business"}

$ls -l yelp_data.json

显示文件大小为33524921

我能做的更糟糕的是在不同的文件中提取所需的数据并将其导入到该程序中? 改进这个程序以使其更有效地运行会有什么好处?谢谢!!

1 个答案:

答案 0 :(得分:0)

与性能/内存无关,但您可以替换:

for i in range(len(data)):
    star.append(data[i].values()[10])

由:

for item in data:
    star.append(item.values()[10])

datalist,它是可迭代的。 https://docs.python.org/3/library/stdtypes.html#list

同样在Python 3中,索引dict值不再起作用,最终会得到:

    star.append(data[i].values()[10])
TypeError: 'dict_values' object does not support indexing

由于data中的项目是json dicts,您可能希望按名称搜索属性,而不是依赖属性索引:

for item in data:
    star.append(item['thekeyyourelookingfor'])

然后让它成为单行:

star = [item['thekeyyourelookingfor'] for item in data]

编辑:实际上,因为json.loads将JSON字符串读入字典,所以订单或属性是任意的,因此当您通过索引访问它们时,您很可能最终会遇到不同的属性比你正在寻找的属性。在这里,您想要阅读stars我猜。 我甚至猜测这就是你的代码失败的原因,因为你提供了他不期望的sklearn输入。