我正在做一些网络抓取,我正在以下列形式存储感兴趣的变量:
a = {'b':[100, 200],'c':[300, 400]}
这是一页,其中有两个b
和两个c
。下一页可以有三个,我将它们存储为:
b = {'b':[300, 400, 500],'c':[500, 600, 700]}
当我从DataFrame
列表中创建dict
时,我得到:
import pandas as pd
df = pd.DataFrame([a, b])
df
b c
0 [100, 200] [300, 400]
1 [300, 400, 500] [500, 600, 700]
我期待的是:
df
b c
0 100 300
1 200 400
2 300 500
3 400 600
4 500 700
每次我存储页面时都可以创建DataFrame
,最后concat
列出DataFrame
。但是,根据经验,这是非常昂贵的,因为构建数千个DataFrame
比从较低级别的构造函数创建一个DataFrame
要昂贵得多(即{{1}的列表}的)。
答案 0 :(得分:1)
为清晰起见,请尝试更改密钥:
a = {'e':[100, 200],'f':[300, 400]}
b = {'e':[300, 400, 500],'f':[500, 600, 700]}
c = {'e':[300, 400, 500],'f':[500, 600, 700]}
listDicts = [a,b,c]
dd= {}
for x in listDicts:
for k in listDicts[0].keys():
try: dd[k] = dd[k] + x[k]
except: dd[k] = x[k]
df = pd.DataFrame(dd)
e f
0 100 300
1 200 400
2 300 500
3 400 600
4 500 700
5 100 300
6 200 400
7 300 500
8 400 600
9 500 700
答案 1 :(得分:1)
理解FTW(也许不是最快的,但你可以得到更多的pythonic吗?):
import pandas as pd
list_of_dicts = [{'b': [100, 200], 'c': [300, 400]},
{'b': [300, 400, 500], 'c': [500, 600, 700]}]
def extract(key):
return [item for x in list_of_dicts for item in x[key]]
df = pd.DataFrame({k: extract(k) for k in ['b', 'c']})
编辑:
我的立场得到了纠正。它和其他一些方法一样快。
import pandas as pd
import toolz
list_of_dicts = [{'b': [100, 200], 'c': [300, 400]},
{'b': [300, 400, 500], 'c': [500, 600, 700]}]
def extract(key):
return [item for x in list_of_dicts for item in x[key]]
def merge_dicts(trg, src):
for k, v in src.items():
trg[k].extend(v)
def approach_AlbertoGarciaRaboso():
df = pd.DataFrame({k: extract(k) for k in ['b', 'c']})
def approach_root():
df = pd.DataFrame(toolz.merge_with(lambda x: list(toolz.concat(x)), list_of_dicts))
def approach_Merlin():
dd = {}
for x in list_of_dicts:
for k in list_of_dicts[0].keys():
try: dd[k] = dd[k] + x[k]
except: dd[k] = x[k]
df = pd.DataFrame(dd)
def approach_MichaelHoff():
merge_dicts(list_of_dicts[0], list_of_dicts[1])
df = pd.DataFrame(list_of_dicts[0])
%timeit approach_AlbertoGarciaRaboso() # 1000 loops, best of 3: 501 µs per loop
%timeit approach_root() # 1000 loops, best of 3: 503 µs per loop
%timeit approach_Merlin() # 1000 loops, best of 3: 516 µs per loop
%timeit approach_MichaelHoff() # 100 loops, best of 3: 2.62 ms per loop
答案 2 :(得分:0)
如何简单地在每一步中合并字典?
import pandas as pd
def merge_dicts(trg, src):
for k, v in src.items():
trg[k].extend(v)
a = {'b':[100, 200],'c':[300, 400]}
b = {'b':[300, 400, 500],'c':[500, 600, 700]}
merge_dicts(a, b)
print(a)
# {'c': [300, 400, 500, 600, 700], 'b': [100, 200, 300, 400, 500]}
print(pd.DataFrame(a))
# b c
# 0 100 300
# 1 200 400
# 2 300 500
# 3 400 600
# 4 500 700