Question

我正在尝试重现本文的结果：https://arxiv.org/pdf/1607.06520.pdf

特别是这部分：

为了识别性别子空间，我们采用了十个性别对差异向量并计算了其主要成分（PC）。如图6所示，有一个方向可以解释这些向量中的大部分方差。第一个特征值明显大于其余特征值。

我使用与作者相同的单词向量集（Google新闻语料库，300维度），我将其加载到word2vec中。

十个性别对差异向量＆＃39;作者引用的是从以下单词对计算出来的：

我以下列方式计算了每个标准化向量之间的差异：

model = gensim.models.KeyedVectors.load_word2vec_format('GoogleNews-vectors-
negative300.bin', binary = True)
model.init_sims()

pairs = [('she', 'he'),
('her', 'his'),
('woman', 'man'),
('Mary', 'John'),
('herself', 'himself'),
('daughter', 'son'),
('mother', 'father'),
('gal', 'guy'),
('girl', 'boy'),
('female', 'male')]

difference_matrix = np.array([model.word_vec(a[0], use_norm=True) - model.word_vec(a[1], use_norm=True) for a in pairs])

然后我根据论文在结果矩阵上执行PCA，包含10个组件：

from sklearn.decomposition import PCA
pca = PCA(n_components=10)
pca.fit(difference_matrix)

但是，当我查看pca.explained_variance_ratio_：

时，我会得到截然不同的结果

array([  2.83391436e-01,   2.48616155e-01,   1.90642492e-01,
         9.98411858e-02,   5.61260498e-02,   5.29706681e-02,
         2.75670634e-02,   2.21957722e-02,   1.86491774e-02,
         1.99108478e-32])

或图表：

当应该超过60％时，第一个组件占差异的不到30％！

我得到的结果与我在随机选择的向量上尝试PCA时得到的结果类似，所以我一定做错了，但我无法弄清楚是什么。

注意：我已经尝试过没有规范化矢量，但我得到了相同的结果。

Answer 1

他们在github上发布了该论文的代码：https://github.com/tolga-b/debiaswe

具体来说，您可以在this文件中看到他们用于创建PCA图的代码。

以下是该文件中的相关代码段：

[DllImport("user32.dll", CharSet = CharSet.Auto, ExactSpelling = true)]
public static extern bool OpenClipboard(IntPtr hWndNewOwner);

[DllImport("user32.dll", CharSet = CharSet.Auto, ExactSpelling = true)]
public static extern bool CloseClipboard();

[DllImport("user32.dll", CharSet = CharSet.Auto, ExactSpelling = true)]
public static extern IntPtr GetClipboardData(uint format);

[DllImport("user32.dll", CharSet = CharSet.Auto, ExactSpelling = true)]
public static extern bool IsClipboardFormatAvailable(uint format);

private Image GetMetaImageFromClipboard()
{
  try
  {
    Bitmap image;
    Metafile emf = null;
    if (OpenClipboard(IntPtr.Zero))
    {
      if (IsClipboardFormatAvailable(CF_ENHMETAFILE))
      {
        var ptr = GetClipboardData(CF_ENHMETAFILE);
        if (!ptr.Equals(IntPtr.Zero))
          emf = new Metafile(ptr, true);
      }

      CloseClipboard();
    }

    image = new Bitmap(emf.Width, emf.Height, PixelFormat.Format32bppPArgb);
    Graphics g = Graphics.FromImage(image);

    g.DrawImage(emf, 0, 0, image.Width, image.Height);
    g.Dispose();
    emf.Dispose();
    return image;
  }
  catch (Exception ex)
  {
    // some logs
    return null;
  }
  finally
  {
    CloseClipboard();
  }

}

根据代码，看起来他们正在采用成对的每个单词与该对的平均向量之间的差。对我来说，不清楚这就是论文中的意思。但是，我用它们的代码对运行了这些代码，并能够从纸上重新创建图形：

Answer 2

扩展牛至的答案：

对于每对a和b，他们计算中心c =（a + b）/ 2，然后包括指向两个方向的向量-a-c和b-c。

之所以如此重要，是因为PCA为您提供了变化最大的向量。您所有的向量都指向相同的方向，因此在您试图揭示的方向上几乎没有差异。

他们的集合包含指向性别子空间中两个方向的向量，因此PCA清楚地揭示了性别差异。

关于word2vec嵌入的PCA

2 个答案: