我有一个哈希文件,看起来像这样,数据逐行分隔:
Amy:0001:[{'name': 'Amy', 'age': '14', 'grade': '7', 'award': '0'}]
Carl:0024:[{'name': 'Carl', 'age': '12', 'grade': '6', 'award': '2'}, {'name': 'Carl', 'age': '18', 'grade': '12', 'award': '4'}, {'name': 'Carl', 'age': '13', 'grade': '6', 'award': '7'}]
更多...
我想要一个像这样的数据框:
name age grade award
Amy:0001 Amy 14 7 0
Carl:0024 Carl 12 6 2
Carl:0024 Carl 18 12 4
Carl:0024 Carl 13 6 7
我试图逐行剥离哈希
lines = [line.rstrip('\n') for line in open("my_file.txt")]
答案 0 :(得分:2)
从一个空的DataFrame开始:
df = pd.DataFrame(columns=['key','name','age','grade','award'])
逐行将哈希文件读入数据框:
import json
with open(hash_path, 'r') as f:
for line in f:
key = ":".join(line.split(":", 2)[:2])
rows = line.split(":", 2)[-1]
# json requires double quotes for strings
rows = json.loads(rows.replace("'",'"'))
for row in rows:
row['key'] = key
df = df.append(pd.Series(row), ignore_index=True)
# set the 'key' column to the index
df.set_index('key', inplace=True)
答案 1 :(得分:1)
这是使用ast.literal_eval
的解决方案,不需要显式的逐行迭代。您应该发现它效率更高。
from io import StringIO
from ast import literal_eval
x = """Amy:0001:[{'name': 'Amy', 'age': '14', 'grade': '7', 'award': '0'}]
Carl:0024:[{'name': 'Carl', 'age': '12', 'grade': '6', 'award': '2'}, {'name': 'Carl', 'age': '18', 'grade': '12', 'award': '4'}, {'name': 'Carl', 'age': '13', 'grade': '6', 'award': '7'}]"""
df = pd.read_csv(StringIO(x), delimiter='[', header=None, names=['id', 'data'])
df['id'] = df['id'].str[:-1]
df['data'] = df['data'].map(lambda x: literal_eval(f'[{x}'))
lens = df['data'].str.len()
df = pd.DataFrame({'id': np.repeat(df['id'].values, lens)})\
.join(pd.DataFrame(list(chain.from_iterable(df['data']))))\
.set_index('id')
print(df)
age award grade name
id
Amy:0001 14 0 7 Amy
Carl:0024 12 2 6 Carl
Carl:0024 18 4 12 Carl
Carl:0024 13 7 6 Carl