考虑以下示例:
tf_vectorizer = CountVectorizer(max_df=1, min_df=0,
max_features=None,
stop_words=None)
all_docs = ['ETH:0x0000 00:17:A4:77:9C:04 09:00:2B:00:00:05 0 PortA Unknown 755 0 45300 ETH FirstHourDay_21 LastHourDay_23 duration_6911 ThreatScore_nan ThreatCategory_nan False Anomaly_False',
'ETH:0x0000 00:17:A4:77:9C:04 09:00:2B:00:00:05 2 PortC Unknown 774 0 46440 ETH FirstHourDay_21 LastHourDay_23 duration_6911 ThreatScore_nan ThreatCategory_nan False Anomaly_False',
'ETH:0x0000 00:17:A4:77:9C:0A 09:00:2B:00:00:05 0 PortA Unknown 752 0 45120 ETH FirstHourDay_21 LastHourDay_23 duration_6913 ThreatScore_nan ThreatCategory_nan False Anomaly_False',
'ICMP 10.6.224.1 71.6.165.200 0 PortA 192 IP-ICMP 1 1 70 ICMP FirstHourDay_22 LastHourDay_22 duration_0 ThreatScore_122,127 ThreatCategory_21,23 True Anomaly_True',
'ICMP 10.6.224.1 71.6.165.200 2 PortC 192 IP-ICMP 1 1 70 ICMP FirstHourDay_22 LastHourDay_22 duration_0 ThreatScore_122,127 ThreatCategory_21,23 True Anomaly_True',
'ICMP 10.6.224.1 185.93.185.239 0 PortA 192 IP-ICMP 1 1 70 ICMP FirstHourDay_22 LastHourDay_22 duration_0 ThreatScore_127 ThreatCategory_23 True Anomaly_True']
tf_v = tf_vectorizer.fit(all_docs)
获得的词汇是:
{'0a': 0,
'185': 1,
'239': 2,
'45120': 3,
'45300': 4,
'46440': 5,
'752': 6,
'755': 7,
'774': 8,
'93': 9,
'duration_6913': 10,
'threatcategory_23': 11,
'threatscore_127': 12}
词汇表中缺少某些单词,例如ETH, FirstHourDay_22, Anomaly_True
。
这是为什么?我怎样才能有完整的词汇量?
编辑:
该错误可能是由于CountVectorizer中的token_pattern
值
编辑: 我建议使用以下变量来解决问题:
all_docs=['ETH0x0000 0017A4779C04 09002B000005 0 PortA Unknown 755 0 45300 FirstHourDay21 LastHourDay23 duration6911 ThreatScorenan ThreatCategorynan False AnomalyFalse',
'ETH0x0000 0017A4779C04 09002B000005 2 PortC Unknown 774 0 46440 FirstHourDay21 LastHourDay23 duration6911 ThreatScorenan ThreatCategorynan False AnomalyFalse',
'ETH0x0000 0017A4779C0A 09002B000005 0 PortA Unknown 752 0 45120 FirstHourDay21 LastHourDay23 duration6913 ThreatScorenan ThreatCategorynan False AnomalyFalse',
'ICMP 10.6.224.1 71.6.165.200 0 PortA 192 IP-ICMP 1 1 70 FirstHourDay22 LastHourDay22 duration0 ThreatScore122,127 ThreatCategory21,23 True AnomalyTrue',
'ICMP 10.6.224.1 71.6.165.200 2 PortC 192 IP-ICMP 1 1 70 FirstHourDay22 LastHourDay22 duration0 ThreatScore122,127 ThreatCategory21,23 True AnomalyTrue',
'ICMP 10.6.224.1 185.93.185.239 0 PortA 192 IP-ICMP 1 1 70 FirstHourDay22 LastHourDay22 duration0 ThreatScore127 ThreatCategory23 True AnomalyTrue']