Question

我有一个长度为10个字符的字符串列表。

final_list = ['ACTGCATGTC',
 'CAACACAACG',
 'TTCATGCCGA',
 'AGCCGTGTAT',
 'CAGTCACCAT',
 'TCGTACGTGC',
 'GAGATTGGTG',
 'GCATGTTCCA',
 ...]

Full file

我想从1389个总字符串中选择384，以便A，C，G和T字符尽可能同等地表示：

from collections import defaultdict
import pandas as pd

balance_df = pd.DataFrame.from_records(final_list)

pos_dict = defaultdict()

for i in range(0, len(balance_df.columns)):
    pos_dict[i] = Counter(balance_df[i])

pd.DataFrame.from_dict(pos_dict)

理想情况下，每个字母应在最终384列表中的每个位置代表96次。

      0 1   2   3   4   5   6   7   8   9
A   383 375 372 353 342 342 333 326 319 318
C   401 398 388 380 380 373 367 372 381 379
G   304 317 315 350 349 360 363 366 372 380
T   301 299 314 306 318 314 326 325 317 312

我尝试通过跟踪接受的字符串，然后创建两个最不明确的字符列表，并且只允许在下一次迭代中添加这些字符来实现此目的：

from heapq import nsmallest

compliance_dict = defaultdict(dict)
for s in range(0,10):
#set up dict
    compliance_dict[s]['A'] = 0
    compliance_dict[s]['T'] = 0
    compliance_dict[s]['G'] = 0
    compliance_dict[s]['C'] = 0


def acceptable_balance(counts, str_to_add):

    allowed = defaultdict(list)
    for s in range(0,10):
        ratio_dict = defaultdict()
        total_row = sum(compliance_dict[s].values())
        if total_row == 0:
            allowed[s].extend(['A','T','C','G'])
        else:
            ratio_dict['A'] = compliance_dict[s].get('A')/total_row
            ratio_dict['T'] = compliance_dict[s].get('T')/total_row
            ratio_dict['G'] = compliance_dict[s].get('G')/total_row
            ratio_dict['C'] = compliance_dict[s].get('C')/total_row

        two_lowest = nsmallest(2,ratio_dict,key=lambda x: (ratio_dict.get(x),x))

        for al in two_lowest:
            allowed[s].append(al)

    reject = []
    for s in range(0,10):
        if str_to_add[s] in allowed[s]:
            reject.append(0)
        else:
            reject.append(1)

    if sum(reject) == 0:
        add = True
    else:
        add = False

    return add

def check_balance(count_dict, new_str):


    added = False

    if acceptable_balance(count_dict, new_str):
        for s in range(0,len(new_str)):

            #add count
            count_dict[s][new_str[s]] += 1

        added = True

    return added

Answer 1

首先，有1.07e354种组合，因此强制使用它们是不可能的。

任何依赖于根据目前已接受的字符串做出未来决策的算法都可能陷入局部极值。例如，如果下一个字符串符合您的标准怎么办，但是如果您拒绝它并等待它之后的那个字符串，您会得到一个完美的解决方案吗？如果你接受下一个，你将会做的，现在可能会被拒绝。在最糟糕的情况下，根据您的选择，到目前为止，没有可用的字符串将会更好，并且您无法获得解决方案。

您的方法非常不灵活，因为您拒绝任何不具有每个位置的两个代表性最低基数之一的字符串。你甚至无法达到一个解决方案，除非你有一个非常低的容差，例如，允许一个字符串，只要它的一半基数来自每个位置最低代表的两个。即使这样，解决方案也会非常不理想。

解决方案

我提出了一种迭代度量最小化方法。您可以选择任何384个字符串，然后将其余部分留在＆＃34;池中＃34;。对于所选列表中的每个字符串，您可以将其替换为池中的每个字符串，并衡量这是否会改进您的指标。如果是，则进行切换。在完成所有384个字符串后，如果您的指标得到了改进，您可以再次开始此过程，否则您已融合到解决方案中。

我们可以将每个字符串表示为一个4x10表，就像问题中的表一样，在适当的位置有1个，在其他地方有0个。事实上，如果我们有一个包含40个元素的平面阵列，效率会略高一些，但这个想法是一样的。在我们总结所有384个这样的数组之后，我们得到了你的pandas表的等价物。由于平均值为96，并且您希望尽可能多的元素尽可能接近96，因此标准差（SD）是完美的度量标准。

import numpy as np

def decompose_strings(strings):
    decomposition = np.zeros((len(strings), 40,))
    strides = dict(zip('ATCG', range(4)))
    for i, string in enumerate(strings):
        for j, value in enumerate(string):
            decomposition[i,10 * strides[value] + j] = 1
    return decomposition

def minimise_variance(table, size):
    idx = list(np.random.choice(range(table.shape[0]), size, replace=False))
    chosen = idx
    pool = [i for i in range(table.shape[0]) if i not in idx]

    print('{0:>10s}{1:>10s}'.format('start', 'end'))
    print('-' * 20)
    std = table[chosen].sum(axis=0).std()
    while True:
        start_std = std
        for i, chosen_idx in enumerate(chosen):
            # for each `i`, the remaining `size` - 1 elements will sum up
            # to the same costant, so we should only calculate it once
            temp_sum = table[chosen].sum(axis=0) - table[chosen_idx]
            j_better = None
            for j, pool_idx in enumerate(pool):
                current_std = (temp_sum + table[pool_idx]).std()
                if current_std < std:
                    std = current_std
                    j_better = j
            if j_better is not None:
                chosen[i] = pool[j_better]
                pool[j_better] = chosen_idx
            else:
                chosen[i] = chosen_idx
        print('{0:10.6f}{1:10.6f}'.format(start_std, std))
        if start_std == std:
            break
    return chosen

并运行它

with open('final_list.txt') as f:
    data = f.read().split('\n')[:-1]
table = decompose_strings(data)

solution = minimise_variance(table, 384)

平均而言，解决方案在4次迭代中收敛，每次迭代在我的机器上花费15秒。

每个解决方案都会有很多96个表值，而少数将是95或97.事实上，每个95都将与97配对，因此平均值可以是96.这意味着错误的数量永远是偶数，在这种情况下，我们甚至可以用np.sqrt(errors / 40)来计算SD。

我从200次运行中收集了结果并绘制了错误数量的直方图（反转上面的公式以从SD计算它）。

修改

如果我们将解决方案联系起来，我们可以做得更好。我们再次调用该函数并要求它从先前返回的解决方案开始，但我们将一个元素换成新元素然后让它收敛。虽然通过交换随机元素确实增加了SD并且新解决方案甚至可能具有比前一个更高的SD，但SD似乎通常被限制在10-14误差范围内。不仅如此，新函数调用很可能会在2次迭代中收敛;一个找到新的东西，一个确认没有更好的东西。

# just change this
def minimise_variance(table, size):
    idx = list(np.random.choice(range(table.shape[0]), size, replace=False))

# to this
def minimise_variance(table, size, idx=None):
    if not idx:
        idx = list(np.random.choice(range(table.shape[0]), size, replace=False))
    else:
        idx = list(idx)
        # By shuffling the indices we ensure there is no bias
        # in which element is rotated out and which ones are
        # considered first for improvement.
        np.random.shuffle(idx)
        while True:
            switch_idx = np.random.choice(range(table.shape[0]))
            if switch_idx not in idx:
                # if we were to switch out the first element, it's likely
                # the old solution could be found again
                idx[-1] = switch_idx
                break

然后像这样运行

solutions = [minimise_variance(table, 384)]
for _ in range(1, 10):
    solutions.append(minimise_variance(table, 384, solutions[-1]))

我写了这个代码的C版本并收集了100k次运行。

有22个解决方案有4个错误，彼此之间相当独特。

其中一个的分类索引是

[3, 11, 28, 121, 123, 125, 132, 263, 264, 272, 292, 307, 314, 319, 334, 341, 350, 355, 365, 366, 371, 388, 390, 399, 401, 404, 425, 434, 441, 449, 458, 459, 474, 475, 480, 484, 485, 486, 487, 488, 489, 490, 496, 498, 499, 500, 501, 502, 504, 505, 507, 508, 512, 516, 517, 518, 519, 523, 525, 530, 534, 535, 540, 541, 544, 546, 548, 549, 551, 552, 555, 557, 558, 559, 560, 562, 563, 564, 566, 567, 569, 570, 572, 573, 574, 575, 576, 577, 578, 579, 580, 581, 582, 583, 584, 586, 587, 589, 591, 593, 600, 611, 633, 643, 647, 655, 658, 659, 665, 667, 668, 669, 672, 674, 679, 680, 683, 686, 693, 697, 715, 718, 720, 723, 724, 725, 729, 732, 735, 736, 737, 741, 742, 749, 751, 753, 755, 758, 760, 764, 765, 766, 767, 771, 772, 773, 775, 779, 780, 782, 783, 786, 787, 789, 790, 791, 798, 801, 806, 807, 808, 810, 811, 814, 816, 817, 820, 822, 823, 825, 826, 827, 830, 831, 832, 834, 835, 836, 840, 843, 845, 846, 847, 849, 850, 853, 855, 858, 867, 871, 874, 884, 887, 889, 897, 900, 905, 912, 915, 918, 941, 946, 956, 958, 959, 966, 971, 975, 976, 980, 984, 986, 988, 990, 991, 996, 999, 1001, 1003, 1011, 1013, 1015, 1016, 1017, 1018, 1020, 1028, 1029, 1032, 1036, 1037, 1038, 1039, 1041, 1042, 1045, 1046, 1047, 1048, 1049, 1050, 1055, 1057, 1058, 1059, 1060, 1061, 1062, 1063, 1064, 1065, 1066, 1067, 1069, 1071, 1072, 1074, 1075, 1076, 1077, 1078, 1080, 1083, 1084, 1085, 1087, 1089, 1091, 1093, 1095, 1098, 1099, 1103, 1107, 1109, 1110, 1113, 1118, 1119, 1124, 1125, 1126, 1127, 1128, 1130, 1133, 1135, 1136, 1138, 1140, 1141, 1142, 1145, 1146, 1149, 1150, 1152, 1153, 1154, 1156, 1157, 1158, 1159, 1160, 1161, 1162, 1163, 1164, 1165, 1166, 1167, 1169, 1170, 1171, 1173, 1175, 1176, 1178, 1179, 1180, 1181, 1182, 1183, 1184, 1185, 1187, 1188, 1189, 1190, 1191, 1192, 1194, 1196, 1198, 1199, 1201, 1203, 1204, 1205, 1206, 1207, 1208, 1209, 1210, 1211, 1212, 1213, 1214, 1217, 1218, 1220, 1221, 1222, 1223, 1224, 1225, 1226, 1227, 1230, 1231, 1233, 1234, 1235, 1236, 1240, 1241, 1242, 1243, 1246, 1247, 1250, 1255, 1257, 1258, 1259, 1260, 1262, 1265, 1266, 1267, 1268, 1276, 1279, 1321]

和它的熊猫表

    0   1   2   3   4   5   6   7   8   9
A  96  96  96  96  96  96  96  96  96  96
C  97  96  96  96  96  96  96  96  96  96
G  96  96  96  96  96  97  96  96  96  96
T  95  96  96  96  96  95  96  96  96  96

均衡的字符串选择

1 个答案:

解决方案

修改