通过python从行到行yelp数据集读取

时间:2018-06-23 09:23:34

标签: python indexing bigdata yelp

我想将此代码更改为专门从1400001行读取为1450000。什么是修改? 文件由单一对象类型组成,每行一个JSON对象。    我还想将输出保存到.csv文件。我该怎么办?

revu=[]
with open("review.json", 'r',encoding="utf8") as f:
      for line in f:
       revu = json.loads(line[1400001:1450000)

1 个答案:

答案 0 :(得分:0)

如果每行是JSON:

revu=[]
with open("review.json", 'r',encoding="utf8") as f:
    # expensive statement, depending on your filesize this might
    # let you run out of memory
    revu = [json.loads(s) for s in f.readlines()[1400001:1450000]]

如果您在/ etc / passwd文件中执行此操作,则很容易测试(当然,没有json,因此可以忽略)

revu = []
with open("/etc/passwd", 'r') as f:
    # expensive statement
    revu = [s for s in f.readlines()[5:10]]

print(revu)  # gives entry 5 to 10

或者您遍历所有行,从而避免出现内存问题:

revu = []
with open("...", 'r') as f:
    for i, line in enumerate(f):
        if i >= 1400001 and i <= 1450000:
            revu.append(json.loads(line))

# process revu   

至CSV ...

import pandas as pd
import json

def mylines(filename, _from, _to):
    with open(filename, encoding="utf8") as f:
        for i, line in enumerate(f):
            if i >= _from and i <= _to:
                yield json.loads(line)

df = pd.DataFrame([r for r in mylines("review.json", 1400001, 1450000)])
df.to_csv("/tmp/whatever.csv")