这是我尝试用于训练词嵌入的语料库的一个片段。
news_subent_12402 news_dlsub_00322 news_dlsub_00001 news_sub_00035 news_subent_07737 news_sub_00038 news_dlsub_00925 news_subent_07934 news_sub_00057 news_dlsub_01826 news_dlsub_00437 news_sub_00037 news_sub_00050 news_dlsub_00205 news_sub_00270 news_subent_05735 news_dlsub_00143 news_subent_12439 news_sub_00051 news_subent_08446 news_dlsub_00091 news_sub_00222 news_dlsub_00009 news_dlsub_00126 news_subent_15202 news_dlsub_00019 news_sub_00076 news_dlsub_00059 news_subent_11158 news_subent_10981 news_dlsub_00634 news_dlsub_00018 news_subent_03496 news_subent_16059 news_subent_08005 news_dlsub_00020 news_subent_15460 news_dlsub_00908 news_subent_12712 news_sub_00258 news_sub_00048 news_dlsub_00022 news_dlsub_00206 news_dlsub_00106 news_sub_00248 news_sub_00047 news_subent_02476 news_subent_14554 news_dlsub_00134 news_sub_00070 news_subent_06676 news_dlsub_00306 news_subent_11635 news_dlsub_01137 news_sub_00081 news_dlsub_00024 news_dlsub_00242 news_dlsub_00920 news_dlsub_00198 news_subent_02562 news_subent_09358 news_dlsub_00101 news_subent_02696 news_subent_17124 news_sub_00244 news_dlsub_00045 news_sub_00049 news_dlsub_00575 news_dlsub_00163 news_subent_03497 news_subent_10972 news_subent_05406 news_sub_00039 news_subent_14976 news_subent_20148 news_subent_02955 news_sub_00245 news_subent_02399 news_dlsub_00669 news_subent_12423 news_dlsub_00180 news_dlsub_00013 news_dlsub_00075 news_sub_00264 news_dlsub_01833 news_sub_00040 news_sub_00257 news_dlsub_00021 news_subent_14967 news_subent_03495 news_dlsub_00035 news_subent_21377 news_sub_00059 news_dlsub_01260 news_sub_00232 news_dlsub_00316 news_dlsub_00014 news_dlsub_00023 news_dlsub_00046 news_subent_02007 news_dlsub_00458 news_dlsub_00269 news_subent_04653 news_subent_06231 news_dlsub_01751 news_dlsub_00186 news_dlsub_00043 news_dlsub_00128 news_subent_05276 news_sub_00259 news_dlsub_00102 news_sub_00268 news_dlsub_00185 news_sub_00041 news_subent_09122 news_dlsub_00116 news_subent_09210 news_subent_07733 news_subent_06393 news_dlsub_00244 news_dlsub_00622 news_sub_00226 news_sub_00043 news_dlsub_00067
news_subent_03827 news_dlsub_00065 news_sub_00251 news_dlsub_01826 news_subent_17688 news_subent_07649 news_subent_02941 news_dlsub_00100 news_subent_08198 news_subent_02990 news_dlsub_00033 news_subent_02562 news_dlsub_00043 news_dlsub_00024 news_dlsub_00015 news_subent_07628 news_subent_07045 news_dlsub_00234 news_subent_09178 news_dlsub_00458 news_subent_02923 news_sub_00226 news_dlsub_00120 news_sub_00247 news_dlsub_00014 news_dlsub_01830 news_subent_02946 news_dlsub_00086 news_dlsub_00046 news_dlsub_00038 news_subent_16554 news_subent_03073 news_dlsub_00128 news_dlsub_00098 news_subent_02905 news_subent_09117 news_dlsub_00021 news_dlsub_00143 news_subent_03054 news_dlsub_00126 news_subent_16372 news_dlsub_01833 news_subent_03495 news_sub_00245 news_dlsub_00101 news_sub_00258 news_subent_11431 news_sub_00148 news_subent_09320 news_sub_00232 news_subent_02460 news_dlsub_00032 news_dlsub_00067 news_dlsub_00064 news_dlsub_00045 news_dlsub_00116 news_subent_11663 news_subent_03501 news_subent_02030 news_dlsub_00035 news_dlsub_00476 news_dlsub_00039 news_subent_14505 news_dlsub_00091 news_sub_00244 news_sub_00268 news_dlsub_00130 news_subent_02007 news_subent_03014 news_dlsub_00022 news_dlsub_00019 news_subent_09358 news_dlsub_00270 news_subent_17124 news_dlsub_00071 news_sub_00266 news_subent_06429 news_subent_02621 news_sub_00248
news_subent_03497 news_subent_03495 news_dlsub_01326 news_sub_00151 news_sub_00070 news_dlsub_00143 news_dlsub_00012 news_dlsub_00212 news_subent_04653 news_subent_02022 news_dlsub_00101 football_club_187 news_subent_02902 news_dlsub_00116 news_dlsub_00925 news_sub_00137 news_dlsub_00120 news_sub_00036 news_subent_02889 news_subent_14976 news_dlsub_00269 news_dlsub_00687 news_subent_15202 news_dlsub_00669 news_dlsub_00126 news_sub_00248 news_dlsub_00437 news_sub_00071 news_dlsub_00177 news_dlsub_00694 news_dlsub_00618 news_sub_00051 news_sub_00043 news_subent_14997 news_subent_02411 news_subent_16059 news_sub_00245 news_subent_02923 news_dlsub_00035 news_sub_00069 news_subent_05320 news_sub_00082 news_sub_00259 news_dlsub_01035 news_dlsub_00413 news_sub_00072 news_dlsub_00020 news_sub_00052 news_dlsub_00023 news_subent_03496 news_subent_02893 news_subent_16508 news_sub_00065 news_sub_00047 news_subent_05740 news_subent_13389 news_sub_00055 news_subent_09439 news_subent_02991 news_sub_00268 news_dlsub_00003 news_subent_04609 news_subent_03509 news_subent_04069 news_dlsub_00128 news_dlsub_00099 news_dlsub_00206 news_dlsub_00582 news_sub_00037 news_dlsub_00021 news_sub_00247 news_dlsub_01179 news_sub_00057 news_dlsub_00046 news_sub_00039 news_sub_00050 news_subent_03014 news_sub_00042 news_dlsub_01826 news_sub_00038 news_dlsub_00410 news_subent_12422 news_sub_00048 news_subent_13648 news_dlsub_01807 news_subent_20148 news_sub_00084 news_sub_00049 news_dlsub_00029 news_subent_11392 news_dlsub_00412 news_sub_00246 news_sub_00244 news_subent_16385 news_dlsub_00634 news_subent_13536 news_subent_03073 news_sub_00226 news_subent_11478 news_sub_00035 news_subent_14967 football_club_192 news_sub_00232 news_sub_00054 news_subent_06587 news_dlsub_00014 news_subent_02399 news_dlsub_00013 news_dlsub_00102 news_sub_00040 news_subent_01990 news_dlsub_00007 news_subent_07675 news_subent_07719 news_sub_00041 news_subent_04655 news_dlsub_00300 news_dlsub_00019 news_subent_07756 news_dlsub_00234 news_sub_00076
尽管每一行都是一个句子,而news_dlsub_00001
只是一个完整的单词。我不希望快速文本构造子词嵌入,而我想要的只是诸如news_dlsub_01326
news_subent_12402
等完整词的嵌入。
我的语料库中有15354个不同的单词,总共约有1000万行(句子)。
这是训练脚本:
./fasttext skipgram -input user_profile_tags_rows.txt -output model_user_tags -lr 0.01 -epoch 50 -wordNgrams 1 -bucket 200000 -dim 128 -loss hs -thread 80 -ws 5 -minCount 1
那么我如何设置训练脚本来禁用子词的嵌入表示训练以提高效率?谢谢。
答案 0 :(得分:1)
如果您想训练没有子词信息的词嵌入,可以将-maxn
参数设置为0。这意味着您仅使用最大长度为0的字符ngram,即不使用任何字符ngram。
答案 1 :(得分:0)
将两个选项都设置为零:-maxn 0 -minn 0