我的数据以“键值对”格式存储在数据库中。当我查询数据时,我得到了这种“扁平化”的表格,我需要通过基本上旋转每对键值列来解除它。如果你看下面的示例表,就会更容易理解。
与其他相关问题不同,我有一些不同的规则/限制:
降低抽象水平......将我的数据视为代表对独特样本进行一系列测试的结果,而不是所有样本都得到相同的测试。行entry_ID表示唯一样本。键代表其中一个测试,值是该测试的结果。并非所有样品都进行相同的测试。一些样品得到测试,但没有完成,因此缺少结果。例如,在下表中,样本A得到了测试P和Q,但样本B只得到了测试N.
但是,测试随时间而变化,但数据库却没有。因此,测试和结果将上载到数据库中的相同命名列中。这使我无法简单地使用更改列名称(例如,对于KEY_1 / VALUE_1,我可以将VALUE_1更改为“P”并完成,但它不适用于KEY_0 / VALUE_0)。以下示例是一个简化的小案例。
我的典型查询至少有10k行,其中至少300个key-val对列具有超过300个唯一键,需要将其转换为更合理的分析格式。它们的键是更长的字符串,值是浮点数...因此我的问题。谢谢!
i entry_ID KEY_0 VALUE_0 KEY_1 VALUE_1 KEY_2 VALUE_2 KEY_3 VALUE_3 KEY_4 VALUE_4
0 A None NaN P 183.0 Q 238.0 None NaN R NaN
1 B N 886.0 None NaN None NaN None NaN R NaN
2 C N 156.0 P 905.0 Q 566.0 None NaN R NaN
3 D N 843.0 P 396.0 None NaN None NaN R NaN
4 E None NaN None NaN Q 118.0 None NaN R NaN
5 F N 719.0 P 721.0 Q 526.0 None NaN R NaN
6 G N 894.0 P 136.0 Q 438.0 None NaN R NaN
7 H None NaN P 646.0 None NaN None NaN R NaN
8 I N 447.0 P 978.0 Q 458.0 None NaN R NaN
9 J None NaN None NaN Q 390.0 None NaN R NaN
10 K O 843.0 P 745.0 Q 107.0 None NaN R NaN
11 L O 882.0 None NaN None NaN None NaN R NaN
12 M O 382.0 P 876.0 Q 829.0 None NaN R NaN
i entry_ID N O P Q
0 A NaN NaN 183.0 238.0
1 B 886.0 NaN NaN NaN
2 C 156.0 NaN 905.0 566.0
3 D 843.0 NaN 396.0 NaN
4 E NaN NaN NaN 118.0
5 F 719.0 NaN 721.0 526.0
6 G 894.0 NaN 136.0 438.0
7 H NaN NaN 646.0 NaN
8 I 447.0 NaN 978.0 458.0
9 J NaN NaN NaN 390.0
10 K NaN 843.0 745.0 107.0
11 L NaN 882.0 NaN NaN
12 M NaN 382.0 876.0 829.0
创建上面第一个表的可重复示例(需要Python 3,pandas和numpy,tqdm是可选的)...
import pandas, string, itertools, numpy, time, os
#from tqdm import tqdm
SOME_LETTERS = string.ascii_uppercase
N_KEYVAL_PAIRS = 100
SCALABLE = 3
entry_ID = [''.join(x) for x in list(itertools.permutations(SOME_LETTERS[:13], r=SCALABLE))] # for first 13 letters, n=154440 with r=5
source_keys = [''.join(x) for x in list(itertools.permutations(SOME_LETTERS[13:], r=SCALABLE))] # for first 13 letters, n=154440 with r=5
dick = dict()
dick['entry_ID'] = entry_ID
value_col_names = ['VALUE_' + str(x) for x in range(N_KEYVAL_PAIRS)]
key_col_names = ['KEY_' + str(x) for x in range(N_KEYVAL_PAIRS)]
list_of_cols = ['entry_ID']
source_key_count = 0
#for keycol, valcol in zip(tqdm(key_col_names), value_col_names):
for keycol, valcol in zip(key_col_names, value_col_names):
dummy_values = numpy.random.randint(1, high=1000, size=len(entry_ID), dtype='l')
n_not_null = int(len(entry_ID) * 0.75) # about 25% data is null
n_nulls = len(entry_ID) - n_not_null
dum_vals = numpy.concatenate((numpy.full(n_nulls, numpy.nan), dummy_values[:n_not_null]))
numpy.random.shuffle(dum_vals) # in place!
dummy_keys = numpy.full(len(dum_vals), source_keys[source_key_count], dtype=object)
if numpy.isnan(dum_vals[0]):
source_key_count = source_key_count + 1
dum_keys = numpy.concatenate((dummy_keys[:n_not_null],
numpy.full(n_nulls, source_keys[source_key_count], dtype=object)))
#print('yes')
else:
dum_keys = dummy_keys
numpy.place(dum_keys, numpy.isnan(dum_vals), [None]) # in place!
source_key_count = source_key_count + 1
dick[keycol] = dum_keys
dick[valcol] = dum_vals
list_of_cols.append(keycol)
list_of_cols.append(valcol)
## Add example of both empty
empty_val = numpy.full(len(entry_ID), numpy.nan)
empty_key = numpy.full(len(entry_ID), None, dtype=object)
empty_k_col = 'KEY_' + str(N_KEYVAL_PAIRS + 0)
empty_v_col ='VALUE_' + str(N_KEYVAL_PAIRS + 0)
dick[empty_k_col] = empty_key
dick[empty_v_col] = empty_val
list_of_cols.append(empty_k_col)
list_of_cols.append(empty_v_col)
## Add example of empty val with key
emptyv_val = numpy.full(len(entry_ID), numpy.nan)
notempty_key = numpy.full(len(entry_ID), source_keys[source_key_count], dtype=object)
notempty_k_col = 'KEY_' + str(N_KEYVAL_PAIRS + 1)
emptyv_v_col ='VALUE_' + str(N_KEYVAL_PAIRS + 1)
dick[notempty_k_col] = notempty_key
dick[emptyv_v_col] = emptyv_val
list_of_cols.append(notempty_k_col)
list_of_cols.append(emptyv_v_col)
my_data = pandas.DataFrame(dick)
my_data = my_data[list_of_cols]
#print(my_data.to_string()) # printing can take some time
这是我的hacky尝试。它起作用(我认为),但它需要很长时间,特别是随着表格大小的增长。我不知道大O是什么,但这很糟糕。就像几分钟我的20k行和300个key-val对的真实数据一样。它消耗了大量的RAM。
在上面的代码后运行此代码段...
### PARSE KEY-VALUE PAIRS
# find KV pair columns
df_tgt = my_data
list_KEY_colnames = sorted(list(df_tgt.filter(regex='^KEY_[0-9]{1,3}$').columns))
list_VALUE_colnames = sorted(list(df_tgt.filter(regex='^VALUE_[0-9]{1,3}$').columns))
new_list_KEY_colnames = list_KEY_colnames
new_list_VALUE_colnames = list_VALUE_colnames
allkeys_withnan = pandas.unique(df_tgt[new_list_KEY_colnames].values.ravel()) # assume dupe names from multiple name cols will never be in the same row
allkeys = allkeys_withnan[pandas.notnull(allkeys_withnan)]
df_kv_parsed = pandas.DataFrame(index=df_tgt.index, columns=allkeys) # init
print(time.strftime("%H:%M:%S") + "\tSTARTING PIVOTING\t{}".format(str(os.getpid())))
##### START PIVOTING EACH PAIR ONE BY ONE UGH
#for each_key, each_value in zip(tqdm(new_list_KEY_colnames), new_list_VALUE_colnames):
for each_key, each_value in zip(new_list_KEY_colnames, new_list_VALUE_colnames):
df_single_col_parsed = df_tgt.loc[:, [each_key, each_value]].dropna().pivot(columns=each_key, values=each_value)
df_kv_parsed[df_single_col_parsed.columns.values] = df_single_col_parsed
print(time.strftime("%H:%M:%S") + "\tDONE PIVOTING\t{}".format(str(os.getpid())))
##### KILL ORIGINAL KV PAIRS
df_tgt.drop(list_KEY_colnames, axis=1, inplace=True)
df_tgt.drop(list_VALUE_colnames, axis=1, inplace=True)
##### MERGE WITH ORIGINAL AND THEN SAVE
df_fully_parsed = pandas.concat([df_tgt, df_kv_parsed], axis=1, ignore_index=False)
print(time.strftime("%H:%M:%S") + "\tDONE MERGING\t{}".format(str(os.getpid())))
## REMOVE NULL COLUMNS
df_fully_parsed.dropna(axis=1, how='all', inplace=True)