Numpy:如何将观测转换为概率?

时间:2017-03-30 12:46:42

标签: python numpy

我有一个功能矩阵和相应的目标, 1 zeroes

# raw observations
features = np.array([[1, 1, 0],
                     [1, 1, 0],
                     [0, 1, 0],
                     [0, 1, 0],
                     [0, 1, 0],
                     [0, 0, 1]])

targets = np.array([1, 0, 1, 1, 0, 0])

如您所见,每个功能可能对应于1和0。我需要将我的原始观察矩阵转换为概率矩阵,其中每个特征将对应于将一个视为目标的概率:

[1 1 0] -> 0.5
[0 1 0] -> 0.67
[0 0 1] -> 0

我构建了一个非常直接的解决方案:

import numpy as np

# raw observations
features = np.array([[1, 1, 0],
                     [1, 1, 0],
                     [0, 1, 0],
                     [0, 1, 0],
                     [0, 1, 0],
                     [0, 0, 1]])

targets = np.array([1, 0, 1, 1, 0, 0])

from collections import Counter

def convert_obs_to_proba(features, targets):
    features_ = []
    targets_ = []

    # compute unique rows (idx will point to some representative)
    b = np.ascontiguousarray(features).view(np.dtype((np.void, features.dtype.itemsize * features.shape[1])))
    _, idx = np.unique(b, return_index=True)

    idx = idx[::-1]

    zeros = Counter()
    ones = Counter()

    # collect row-wise number of one and zero targets
    for i, row in enumerate(features[:]):        
        if targets[i] == 0:
            zeros[tuple(row)] += 1
        else:
            ones[tuple(row)] += 1

    # iterate over unique features and compute probabilities
    for k in idx:
        unique_row = features[k]

        zero_count = zeros[tuple(unique_row)]
        one_count = ones[tuple(unique_row)]

        proba = float(one_count) / float(zero_count + one_count)

        features_.append(unique_row)
        targets_.append(proba)

    return np.array(features_), np.array(targets_)

features_, targets_ = convert_obs_to_proba(features, targets)

print(features_)
print(targets_)

其中:

  • 提取独特的功能;
  • 计算每个唯一要素的零个和一个观察目标的数量;
  • 计算概率并构造结果。

使用一些先进的numpy魔法可以用更漂亮的方式解决吗?

更新。以前的代码效率很低O(n ^ 2)。将其转换为更加性能友好。旧代码:

import numpy as np

# raw observations
features = np.array([[1, 1, 0],
                     [1, 1, 0],
                     [0, 1, 0],
                     [0, 1, 0],
                     [0, 1, 0],
                     [0, 0, 1]])

targets = np.array([1, 0, 1, 1, 0, 0])

def convert_obs_to_proba(features, targets):
    features_ = []
    targets_ = []

    # compute unique rows (idx will point to some representative)
    b = np.ascontiguousarray(features).view(np.dtype((np.void, features.dtype.itemsize * features.shape[1])))
    _, idx = np.unique(b, return_index=True)

    idx = idx[::-1]

    # calculate ZERO class occurences and ONE class occurences
    for k in idx:
        unique_row = features[k]

        zeros = 0
        ones = 0

        for i, row in enumerate(features[:]):        
            if np.array_equal(row, unique_row):            
                if targets[i] == 0:
                    zeros += 1
                else:
                    ones += 1

        proba = float(ones) / float(zeros + ones)

        features_.append(unique_row)
        targets_.append(proba)

    return np.array(features_), np.array(targets_)

features_, targets_ = convert_obs_to_proba(features, targets)

print(features_)
print(targets_)

2 个答案:

答案 0 :(得分:5)

使用熊猫很容易:

df = pd.DataFrame(features)
df['targets'] = targets

现在你有:

   0  1  2  targets
0  1  1  0        1
1  1  1  0        0
2  0  1  0        1
3  0  1  0        1
4  0  1  0        0
5  0  0  1        0

现在,花哨的部分:

df.groupby([0,1,2]).targets.mean()

给你:

0  1  2
0  0  1    0.000000
   1  0    0.666667
1  1  0    0.500000
Name: targets, dtype: float64

Pandas不会在0.666行的最左边部分打印0,但是如果你在那里检查它,它确实是0。

答案 1 :(得分:0)

np.sum(np.reshape([targets[f] if tuple(features[f])==tuple(i) else 0 for i in np.vstack(set(map(tuple,features))) for f in range(features.shape[0])],features.shape[::-1]),axis=1)/np.sum(np.reshape([1 if tuple(features[f])==tuple(i) else 0 for i in np.vstack(set(map(tuple,features))) for f in range(features.shape[0])],features.shape[::-1]),axis=1)

你去,numpy魔法!虽然不必要,但可能会使用一些无聊的变量来清理它;) (这可能远非最佳)