Question

我正在尝试使用“嵌入向量”一词建立模型。当我加载向量的数据时，在运行会话时会出错。

我看到很多帖子都犯了同样的错误，但没有一个对我有帮助。我的代码如下：

# Build vocabulary
max_document_length = max([len(x.split(" ")) for x in x_text])
if (not use_glove):
    print ("Not using GloVe")
    vocab_processor = learn.preprocessing.VocabularyProcessor(max_document_length)
    x = np.array(list(vocab_processor.fit_transform(x_text)))
else:
    print ("Using GloVe")
    embedding_dim = 50
    filename = 'glove.twitter.27B.50d.txt'
    def loadGloVe(filename):
        vocab = []
        embd = []
        file = open(filename,'r')
        for line in file.readlines():
            row = line.strip().split(' ')
            vocab.append(row[0])
            embd.append(row[1:])
        print('Loaded GloVe!')
        file.close()
        return vocab,embd
    vocab,embd = loadGloVe(filename)
    vocab_size = len(vocab)
    embedding_dim = len(embd[0])
    embedding = np.asarray(embd)

    W = tf.Variable(tf.constant(0.0, shape=[vocab_size, embedding_dim]),
                    trainable=False, name="W")
    embedding_placeholder = tf.placeholder(tf.float32, [vocab_size, embedding_dim])
    embedding_init = W.assign(embedding_placeholder)
    # embedding_init = np.vstack([np.expand_dims(x, 0) for x in embedding_init])

    session_conf = tf.ConfigProto(allow_soft_placement=True, log_device_placement=False)
    sess = tf.Session(config=session_conf)
    sess.run(embedding_init, feed_dict={embedding_placeholder: embedding})

我得到的错误如下：

>> python train.py
Loading data...
Using GloVe
Loaded GloVe!
Traceback (most recent call last):
  File "train.py", line 88, in <module>
    embedding = np.asarray(embd, dtype=float)
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/numpy/core/numeric.py", line 538, in asarray
    return array(a, dtype, copy=False, order=order)
ValueError: setting an array element with a sequence.
rudaina:CS291K-master rudaina$ python train.py
Loading data...
Using GloVe
Loaded GloVe!
Traceback (most recent call last):
  File "train.py", line 88, in <module>
    embedding = np.asarray(embd, dtype=float)
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/numpy/core/numeric.py", line 538, in asarray
    return array(a, dtype, copy=False, order=order)
ValueError: setting an array element with a sequence.

打印嵌入时，我得到以下信息：

[list（['0.78704'，'0.72151'，'0.29148'，'-0.056527'，'0.31683， '0.47172'，'0.023461'，'0.69568'，'0.20782'，'0.60985'，'-0.22386'， '0.7481'，'-2.6208'，'0.20117'，'-0.48104'，'0.12897'，'0.035239'， '-0.24486'，'-0.36088'，'0.026686'，'0.28978'，'-0.10698'，'-0.34621'， '0.021053'，'0.54514'，'-1.0958'，'-0.274'，'0.2233'，'1.0827'， '-0.029018'，'-0.84029'，'0.58619'，'-0.36511'，'0.34016'，'0.89615'， '0.32757'，'0.24267'，'0.68404'，'-0.34374'，'0.13583'，'-2.2162'， '-0.42537'，'0.46157'，'0.88626'，'-0.22014'，'0.025599'，'-0.38615'， '0.080107'，'-0.075323'，'-0.61461']）列表（['0.68661'，'-1.0772'， '0.011114'，'-0.24075'，'-0.3422'，'0.64456'，'0.54957'，'0.30411'， '-0.54682'，'1.4695'，'0.43648'，'-0.34223'，'-2.7189'，'0.46021'， '0.016881'，'0.13953'，'0.020913'，'0.050963'，'-0.48108'，'-1.0764'， '-0.16807'，'-0.014315'，'-0.55055'，'0.67823'，'0.24359'，'-1.3179'， '-0.036348'，'-0.228'，'1.0337'，'-0.53221'，'-0.52934'，'0.35537'， '-0.44911'，'0.79506'，'0.56947'，'0.071642'，'-0.27455'，'-0.056911'， '-0.42961'，'-0.64412'，'-1.3495'，'0.23258'，'0.25383'，'-0.10226'， '0.65824'，'0.16015'，'0.20959'，'-0.067516'，'-0.51952'，'-0.34922']）清单（['0.98483'，'0.19784'，'0.28403'，'0.35406'，'0.2438'，'0.42519'， '-0.050784'，'0.48965'，'0.18231'，'0.45225'，'0.60871'，'0.1023'， '-2.246'，'0.47362'，'-0.20073'，'-0.21838'，'-0.58847'，'0.23933'， '0.47089'，'-0.96444'，'-0.06588'，'-0.26914'，'-0.58221'，'-0.26283'， '0.67984'，'-0.87678'，'-0.091667'，'0.18128'，'1.0218'，'0.23728'， '-1.0547'，'0.19766'，'-0.86072'，'0.6021'，'0.69374'，'0.32242'， '-0.074545'，'0.38367'，'0.28661'，'-0.41465'，'-2.882'，'-0.30393'， '0.047981'，'1.0937'，'0.4184'，'-0.68958'，'-0.45923'，'0.23368'， '-0.30628'，'-0.093607']）...列表（['0.84287'，'0.36278'，'-1.7695'， '1.0011'，'-0.035064'，'0.51417'，'-1.5918'，'0.85464'，'1.0441'， '-0.19218'，'0.91523'，'1.2206'，'0.6551'，'-0.48092'，'0.89536'， '-0.51738'，'-0.113'，'-0.14132'，'0.69741'，'-0.094937'，'-0.046912'， '-0.2098'，'-0.029853'，'0.49541'，'0.66782'，'0.23435'，'1.6776'， '0.13993'，'1.2205'，'0.11827'，'0.4398'，'-0.37945'，'0.26414'， '0.63263'，'-0.48117'，'-0.95508'，'-0.39435'，'-2.8466'，'-0.64169'， '0.61715'，'3.0288'，'1.2714'，'-2.1379'，'-0.11995'，'-1.5553'， '-0.17096'，'-0.30855'，'-0.24573'，'0.63324'，'-0.80304']）清单（['0.82853'，'-1.4966'，'-0.33163'，'-1.7248'，'0.75364'， '-0.66916'，'0.21631'，'0.54184'，'-0.18342'，'0.4248'，'0.21309'， '0.21076'，'0.60751'，'-0.31577'，'0.5663'，'0.10905'，'0.12388'， '-1.0154'，'0.32227'，'-0.92746'，'-0.59573'，'-0.8008'，'1.146'， '1.1625'，'0.32181'，'0.30272'，'0.99954'，'-1.4012'，'0.076173' '-0.081811'，'1.7618'，'1.0314'，'1.2658'，'1.3319'，'0.52592'， '-0.30999'，'-1.4563'，'-1.4165'，'0.21875'，'0.36172'，'2.7735'， '0.20257'，'0.074379'，'-0.020002'，'-1.0133'，'0.56882'，'-0.17648'， '0.3729'，'0.76953'，'1.4394']）列表（['-2.3613'，'-0.94632'， '-1.8524'，'1.545'，'0.29188'，'0.21677'，'0.090334'，'-1.4557'， '0.80716'，'-0.88994'，'-1.1031'，'0.002139'，'1.211'，'-0.069074'， '1.1984'，'0.93501'，'1.0359'，'-0.17041'，'0.44013'，'-1.7879'， '0.61577'，'0.52878'，'0.32978'，'-0.82872'，'0.48385'，'0.76497'， '-0.64303'，'0.18897'，'0.3698'，'0.62647'，'1.7118'，'-0.2942'， '-0.26316'，'-0.35169'，'-0.72771'，'-0.71678'，'0.91815'，'-0.56122'， '0.51562'，'-0.030861'，'-0.017585'，'-0.58224'，'-0.98393'， '0.85906'，'-0.67031'，'0.34382'，'-0.41876'，'-0.40575'，'-0.53006'， '-0.20514']）]

我该如何解决？虽然花了很多时间试图修复它，但这似乎很简单，但我不知道。

Answer 1

当您尝试从具有不同大小的单个数组的列表中创建 NumPy 数组时，就会发生这种情况。我在读取 GloVe 文件时遇到了这个问题，就像你一样，通过空格字符手动分割每一行。

如果我们的 glove.twitter.27B.50d.txt 是同一个文件，第 38523 行包含

0.065581 0.39605 -0.96669 0.23706 -0.41379 -0.97006 0.16601 -1.292 -0.58989 0.11632 -1.365 -0.27939 -0.57222 -0.97108 -0.56319 -0.015263 -0.70465 -0.13867 1.0702 -0.25557 0.25122 -0.87755 0.70999 0.9118 -0.30077

词汇对我来说似乎是一个不可打印的字符。这将导致代码将第一个嵌入向量读取为词汇，并且在该特定行获得的嵌入向量数量较少（如您的情况为 49 维）。

在 glove.twitter.27B.25d.txt、glove.twitter.27B.100d.txt 和 glove.twitter.27B.200d.txt 中也可以找到完全相同的行

有效的快速而肮脏的解决方案是：

for line in file.readlines():
    row = line.strip().split(' ')
    if len(row)-1 < embedding_dim:
        row.insert(0, '')
    vocab.append(row[0])
    embd.append(row[1:])

ValueError：使用序列设置数组元素。在session.run

1 个答案: