Question

我有一个字典（称为dict），其键是表示要素名称的字符串，其值是浮点数，表示每个要素的计数。

这是我的词典（dict）的一个例子：

{'11268-238-1028'：2.0，'1028'：10.0，'10295'：2.0，'1781'：2.0，'11268-238'：3.0，'6967-167'：1.0，'9742 -232-788'：1.0，'8542'：4.0，'238-1028'：5.0，'1028-122'：1.0}

在此示例中，'10295'被视为一度特征，'6967-167'被视为两度特征，'9742-232-788'被视为三度特征。如果我们有'x-x-x-x-x-x-x'，那么它将是一个七度特征。换句话说，对于任何n度特征，该特征具有（n-1）个破折号（' - '）。

'11268-238-1028'：2.0表示三度特征'11268-238-1028'的计数为2.然后我们看到'11268-238'：3.0，意思是'11268-238 '发生了3次。然而，这是一些双重计数问题，因为在'11268-238'的3次出现中，其中2次实际上是由于'11268-238-1028'的发生。因此，我们希望将'11268-238'的计数更改为其实际计数，即3-2 = 1。

同样，'238-1028'的实际数不是5，因为'238-1028'是'11268-238-1028'的一部分，'11268-238-1028'的数量是2.所以，'238-1028'的实际数应为（5-2 = 3）。

另一个例子是特征'1028'，它的实数不应该是10.'1028'是3度特征'11268-238-1028'的一部分，其数量为2.'1028'也是2度特征'238-1028'的一部分，其数量为5.'1028'也是2度特征'1028-122'的一部分，其数量为1.因此，实际计数为1-度数特征'1028'应为（10-2-5-1 = 2）。

我应该使用哪种算法来解决这个问题？

我考虑过将每个键转换为由dash划分的一组1度特征，然后对于每个集合，对所有其他具有更高长度的集合进行子集成员资格测试。但是，设置存储无序元素，但我关心顺序。例如，转换为set的特征“11268-238-1028”将是（['11268'，'238'，'1028']）;转换为set的另一个功能'11268-1028'将是（['11268'，'1028']）。如果我对这两个特征集进行子集测试，我会得出结论（['11268'，'1028']）是（['11268'，'238'，'1028']）的子集。然而，特征'11268-1028'不是特征'11268-238-1028'的子集，因为在'11268'和'1028'之间，还有另一个'238'，即顺序应该重要。

那我怎么解决这个问题呢？

非常感谢！

Answer 1

将问题分解为更小的复杂问题

首先让我们编写一个实际调整数据字典的辅助函数

# this assumes we have one big feature (ie 3) and several smaller features(ie 2&1)
def adjust_data(big_feature,smaller_features,data):
    for feature in smaller_features:
        if feature.count("-") == big_feature.count("-"):
           continue # skip any features that are the same size as our target
        #3 cases for a sub feature it starts with ends with or is contained
        # we use delimiters to eliminate partial matches
        does_start = big_feature.startswith(feature+"-") 
        does_end = bigfeature.endswith("-"+feature) 
        does_contain = "-"+feature+"-" in big_feature
        if does_start or does_end or does_contain :
            # one of our cases match so this is a sub feature of our big feature
            data[feature] -= data[big_feature]

现在在使用它之前，我们需要组织我们的数据，以便对其进行适当的排序。

 sorted_keys = sorted(my_data_dict.keys(),
                      key=lambda key:key.count("-"), 
                      reversed=True) #we want bigger features on top

现在只需按照排序的data_list

进行操作

  for i,key in enumerate(sorted_keys,1):
      adjust_data(key,sorted_keys[i:],my_data_dict)

这只是蛮力，所以它不会那么快但它会完成工作

Answer 2

首次创建dict时防止重复计算要比以后撤消它更容易。

但是假设dict无法重新创建。这是一个解决方案。它并不假设对于每个更高等级的特征，保证每个学位都具有较低程度的对应物（即，对于特征A1-A2 -...- An，您可能缺少A1，A1-A2中的任何一个等，直至A1-A2 -...- An-1）。如果这个假设实际成立，可以简化一些try-except。

def undo_double_counting(d):
    sorted_features = sorted(d, key=lambda f: f.count('-'), reverse=True)
    for f in sorted_features:
        if '-' not in f:
            return d
        feature_below, _ = f.rsplit('-', 1)
        while True:
            try:
                d[feature_below] -= d[f]
            except KeyError:
                # if the feature one degree below isn't actually in d,
                # we keep trying lower degrees until we know that we
                # can't go lower any more (by hitting ValueError)
                try:
                    feature_below, _ = feature_below.rsplit('-', 1)
                except ValueError:
                    break
            else:
                break
    # if there are no degree-1 features in d, return here
    return d

尝试使用你的数据（顺便说一句，为什么浮动，而不是int？）：

{'1028': 9.0,
 '1028-122': 1.0,
 '10295': 2.0,
 '11268-238': 1.0,
 '11268-238-1028': 2.0,
 '1781': 2.0,
 '238-1028': 5.0,
 '6967-167': 1.0,
 '8542': 4.0,
 '9742-232-788': 1.0}

如何在Python中执行以下有序子集元素测试？

2 个答案: