在Facebook fasttext中指定隐藏单位数

时间:2017-05-22 14:33:56

标签: facebook nlp text-classification fasttext

在用于监督分类的paper on fasttext中,作者通过改变一些参数来指定不同数量的隐藏单位(h是第3,4页上的那个 - 在表1中你看到"它有10个隐藏单位和我们使用和不使用双字母进行评估。")但在阅读the documentation之后,似乎没有一个隐藏单元"要改变的参数。有没有办法指定隐藏单位的数量?或者这与指定-dim选项相同?

1 个答案:

答案 0 :(得分:0)

k是否定的。类

来自https://arxiv.org/pdf/1607.01759v3.pdf

的第2.1节
  

更确切地说,计算复杂度为O(kh),其中k是类的数量,h是文本表示的维度。

在预测文本分类中的类时,来自docs

  

参数k是可选的,默认情况下等于1。   为了获得一段文本的k个最可能的标签,请使用:

     

$ ./fasttext预测model.bin test.txt k

训练模型时,在使用__label__*标签执行监督培训时,会在训练数据中隐式指定。

来自example tutorial

$ wget https://s3-us-west-1.amazonaws.com/fasttext-vectors/cooking.stackexchange.tar.gz && tar xvzf cooking.stackexchange.tar.gz
--2017-05-23 09:03:26--  https://s3-us-west-1.amazonaws.com/fasttext-vectors/cooking.stackexchange.tar.gz
Resolving s3-us-west-1.amazonaws.com... 54.231.236.45
Connecting to s3-us-west-1.amazonaws.com|54.231.236.45|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 457609 (447K) [application/x-gzip]
Saving to: ‘cooking.stackexchange.tar.gz.1’

cooking.stackexchange.tar.gz.1      100%[================================================================>] 446.88K   385KB/s    in 1.2s    

2017-05-23 09:03:28 (385 KB/s) - ‘cooking.stackexchange.tar.gz.1’ saved [457609/457609]

x cooking.stackexchange.id
x cooking.stackexchange.txt
x readme.txt


$ cat readme.txt 
The data in this archive is derived from the user-contributed content on the
Cooking Stack Exchange website (https://cooking.stackexchange.com/), used under
CC-BY-SA 3.0 (http://creativecommons.org/licenses/by-sa/3.0/).

The original data dump can be downloaded from:
https://archive.org/download/stackexchange/cooking.stackexchange.com.7z
and details about the dump obtained from:
https://archive.org/details/stackexchange

We distribute two files, under CC-BY-SA 3.0:

 - cooking.stackexchange.txt, which contains all question titles and
   their associated tags (one question per line, tags are prefixed by
   the string "__label__") ;

 - cooking.stackexchange.id, which contains the corresponding row IDs,
   from the original data dump.