单词嵌入提取

时间:2017-05-02 13:41:39

标签: python python-2.7 vector ipython-notebook vsm

我正在使用python 2.7,并且我已经预先训练了英语嵌入。我需要从这个文件中查找某个嵌入的单词。

该文件有300个维度,格式如下:

  的

-0.0279698616277 -0.00822567637943 -0.066859518431 0.0152934683231 -0.0329719520937 0.0530985715151 0.0346279291928 -0.0342044668875 0.000898163363809 -0.0358478199459 0.0330627337979 -0.0291780565785 0.0226246942919 -0.050316270082 -0.0999551118641 -0.0211768282161 -0.0650169654368 0.0136621823624 -0.13170513108 0.00761099698762 -0.0747038745232 -0.0309831087459 -0.0281774157081 -0.0381752846197 0.000854164869137 0.118230081556 -0.0544820178539 -0.0259578123228 -0.0250848970404 0.0432551614539 0.0604299831315 0.0605994794422 -0.0652365866148 0.0741619690129 -0.0122427203782 -0.0486630776978 0.0266766400501 -0.0575422338293 -0.0120115890454 0.067022888369 0.0563923322428 0.116347799963 0.0272241149902 -0.0271056717851 -0.0876134412848 -0.0160824708647 0.0478176382685 -0.0278610721008 -0.043103116023 -0.123507487497 -0.0286480325182 -0.00985009337681 -0.00749645238334 -0.00322952663845 -0.046423238718 0.103032221776 0.0821490881533 -0.121380150997 -0.00599957532621 -0.08430111579 14 -0.0667407039306 0.0204320098169 -0.0953102074899 -0.0644943672828 -0.00133722007224 0.00249399062204 -0.0199877549741 -0.0494372284268 0.00730022281006 0.100155611334 0.0158984940368 0.0919811737074 -0.0762293413195 0.0495974423547 0.110083862374 -0.0737607844265 0.0507363907294 -0.0101547411817 0.01065877457 0.0437805443228 0.0801814086384 -0.0739505163318 0.0359545673486 -0.0289695742598 0.122458949531 0.0247212132806 -0.0799729263198 -0.0204555870693 -0.00530952298573 -0.0580316010527 0.0849861556452 -0.0386267797212 0.0264685290268 -0.0680456213105 0.0826555349612 -0.0264161763876 0.0344213033507 -0.0995871582083 0.0533503097378 0.037602190303 -0.061794122114 -0.00452664681682 -0.025897662482 -0.0804463278447 -0.0725472056937 0.0121977936453 -0.109343313871

我尝试使用.split(" "),但这也会导致分割矢量。有关如何搜索单词并从文件中提取其向量的任何想法吗?

3 个答案:

答案 0 :(得分:1)

此代码将解析整个文件并使用嵌入向量构建一个dict:

f.xreadlines()

注意:

  • 格式非常严格。没有空行等
  • 适用于Python 2.如果您想使用Python 3,只需将f替换为string uploadPath = HttpContext.Current.Server.MapPath("~/uploads"); MyStreamProvider streamProvider = new MyStreamProvider(uploadPath);

答案 1 :(得分:0)

我发现每个维度都有15个字节或16个字节,如果以' - '。那么,我建议使用re。

import re
res = re.findall(r'(?:-0|0).[0-9]{13}', str)
print(res)

你可以尝试一下。我没有数据,所以我的尝试更难。可能我​​的建议没有帮助!

答案 2 :(得分:0)

怎么样

line = "the -0.0279698616277 -0.00822567637943 -0.0668... etc"
word, vector = line.split(None,1)