Question

我有aws cpu-utilization数据，NAB使用它来使用AWS-SageMaker Random Cut Forest创建异常检测。我能够执行它，但我需要针对“超参数调整”的更深入的解决方案。我已经阅读过AWS文档，但需要了解Hyper Parameter选择。是有根据的Guess或Do，我们是否需要计算co_disp的均值和标准差才能推断出这些参数。

谢谢。

我尝试了100棵树和512/256 tree_size来检测异常，但是如何推断这些参数

    # Set tree parameters
    num_trees = 50
    shingle_size = 48
    tree_size = 512

    # Create a forest of empty trees
    forest = []
    for _ in range(num_trees):
        tree = rrcf.RCTree()
        forest.append(tree)

    # Use the "shingle" generator to create rolling window
    #temp_data represents my aws_cpuutilization data
    points = rrcf.shingle(temp_data, size=shingle_size)

    # Create a dict to store anomaly score of each point
    avg_codisp = {}

    # For each shingle...
    for index, point in enumerate(points):
        # For each tree in the forest...
        for tree in forest:
          # If tree is above permitted size, drop the oldest point (FIFO)
          if len(tree.leaves) > tree_size:
             tree.forget_point(index - tree_size)
        # Insert the new point into the tree
        tree.insert_point(point, index=index)
        """Compute codisp on the new point and take the average among all 
         trees"""
        if not index in avg_codisp:
            avg_codisp[index] = 0
            avg_codisp[index] += tree.codisp(index) / num_trees
    values =[]   
    for key,value in avg_codisp.items():
        values.append(value)

Answer 1

感谢您对RandomCutForest的关注。如果您已标记异常，我们建议您使用SageMaker自动模型调整（https://docs.aws.amazon.com/sagemaker/latest/dg/automatic-model-tuning.html），并让SageMaker找到最合适的组合。

从经验上讲，例如，如果您知道数据具有0.4％的异常，则可以将每棵树的样本数设置为N = 1 /（0.4 / 100）=250。这背后的想法是每棵树代表您的数据样本。树中的每个数据点均被视为“正常”。如果您的树木点太少，例如10，则大多数点看上去将与这些“正常”点不同，即它们的异常得分较高。

树数与基础数据之间的关系更加复杂。随着“标准”点范围的增加，您将希望拥有更多的树。

AWS-Sage Maker随机砍伐森林

1 个答案: