Sklearn CountVectorizer词汇表不完整

时间:2016-10-20 14:23:00

标签: regex dictionary scikit-learn countvectorizer

考虑以下示例:

tf_vectorizer = CountVectorizer(max_df=1, min_df=0,
                                max_features=None,
                                stop_words=None)

all_docs = ['ETH:0x0000 00:17:A4:77:9C:04 09:00:2B:00:00:05 0 PortA Unknown 755 0 45300 ETH FirstHourDay_21 LastHourDay_23 duration_6911 ThreatScore_nan ThreatCategory_nan False Anomaly_False',
            'ETH:0x0000 00:17:A4:77:9C:04 09:00:2B:00:00:05 2 PortC Unknown 774 0 46440 ETH FirstHourDay_21 LastHourDay_23 duration_6911 ThreatScore_nan ThreatCategory_nan False Anomaly_False',
 'ETH:0x0000 00:17:A4:77:9C:0A 09:00:2B:00:00:05 0 PortA Unknown 752 0 45120 ETH FirstHourDay_21 LastHourDay_23 duration_6913 ThreatScore_nan ThreatCategory_nan False Anomaly_False',
 'ICMP 10.6.224.1 71.6.165.200 0 PortA 192 IP-ICMP 1 1 70 ICMP FirstHourDay_22 LastHourDay_22 duration_0 ThreatScore_122,127 ThreatCategory_21,23 True Anomaly_True',
 'ICMP 10.6.224.1 71.6.165.200 2 PortC 192 IP-ICMP 1 1 70 ICMP FirstHourDay_22 LastHourDay_22 duration_0 ThreatScore_122,127 ThreatCategory_21,23 True Anomaly_True',
 'ICMP 10.6.224.1 185.93.185.239 0 PortA 192 IP-ICMP 1 1 70 ICMP FirstHourDay_22 LastHourDay_22 duration_0 ThreatScore_127 ThreatCategory_23 True Anomaly_True']

tf_v = tf_vectorizer.fit(all_docs)

获得的词汇是:

{'0a': 0,
 '185': 1,
 '239': 2,
 '45120': 3,
 '45300': 4,
 '46440': 5,
 '752': 6,
 '755': 7,
 '774': 8,
 '93': 9,
 'duration_6913': 10,
 'threatcategory_23': 11,
 'threatscore_127': 12}

词汇表中缺少某些单词,例如ETH, FirstHourDay_22, Anomaly_True

这是为什么?我怎样才能有完整的词汇量?

编辑: 该错误可能是由于CountVectorizer中的token_pattern

编辑: 我建议使用以下变量来解决问题:

all_docs=['ETH0x0000 0017A4779C04 09002B000005 0 PortA Unknown 755 0 45300 FirstHourDay21 LastHourDay23 duration6911 ThreatScorenan ThreatCategorynan False AnomalyFalse', 'ETH0x0000 0017A4779C04 09002B000005 2 PortC Unknown 774 0 46440 FirstHourDay21 LastHourDay23 duration6911 ThreatScorenan ThreatCategorynan False AnomalyFalse', 'ETH0x0000 0017A4779C0A 09002B000005 0 PortA Unknown 752 0 45120 FirstHourDay21 LastHourDay23 duration6913 ThreatScorenan ThreatCategorynan False AnomalyFalse', 'ICMP 10.6.224.1 71.6.165.200 0 PortA 192 IP-ICMP 1 1 70 FirstHourDay22 LastHourDay22 duration0 ThreatScore122,127 ThreatCategory21,23 True AnomalyTrue', 'ICMP 10.6.224.1 71.6.165.200 2 PortC 192 IP-ICMP 1 1 70 FirstHourDay22 LastHourDay22 duration0 ThreatScore122,127 ThreatCategory21,23 True AnomalyTrue', 'ICMP 10.6.224.1 185.93.185.239 0 PortA 192 IP-ICMP 1 1 70 FirstHourDay22 LastHourDay22 duration0 ThreatScore127 ThreatCategory23 True AnomalyTrue']

0 个答案:

没有答案