我必须将可执行文件归类为恶意和非恶意文件。 我创建了自己的语料库。我已经解释了下面的错误。输入文件格式也在下面给出。如何在各自的文件中显示带有名称的功能,并将其作为数据集保存到文本文件中?如何同时测试多个文件?我是n-gram分类的新手,请帮我解决所有这些问题。提前谢谢。
from sklearn.datasets import load_files
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_selection import chi2
from sklearn.feature_selection import SelectKBest
from sklearn import svm
a = load_files('D:\Train') #contains two folders true(non malicious) and false(malicious). that is targets
vectorizer = CountVectorizer(ngram_range=(4,4))
X = vectorizer.fit_transform(a.data)
B,c = X, a.target
b_new = SelectKBest(chi2, k=1000).fit_transform(B, c)
clf = svm.SVC(gamma="auto", C=1.)
clf.fit(b_new,a.target)
y = vectorizer.transform(open('D:/data/PRE/chrome.txt'))
le = preprocessing.LabelEncoder()
data = le.fit_transform(matrix)
data = data.reshape(1,-1)
print(clf.predict(data))
ERROR:
File "D:/spyder/corpus.py", line 59, in <module>
print(clf.predict(data))
ValueError: X.shape[1] = 482180 should be equal to 1000, the number of features at training time
输入文件格式(hex文件)
90 00 03 00 00 00 04 00 00 00 FF FF 00 00
00 00 00 00 00 00 40 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 F0 00 00 00
BA 0E 00 B4 09 CD 21 B8 01 4C CD 21 54 68
20 70 72 6F 67 72 61 6D 20 63 61 6E 6E 6F
62 65 20 72 75 6E 20 69 6E 20 44 4F 53 20
64 65 2E 0D 0D 0A 24 00 00 00 00 00 00 00
94 01 36 82 FA 52 36 82 FA 52 36 82 FA 52
更新代码
from sklearn.datasets import load_files
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_selection import chi2
from sklearn.feature_selection import SelectKBest
from sklearn import svm
a = load_files('D:\Train') #contains two folders true(non malicious) and false(malicious). that is targets
vectorizer = CountVectorizer(ngram_range=(4,4))
X = vectorizer.fit_transform(a.data)
B,c = X, a.target
ch2 = SelectKBest(chi2, k=1000)
X_train = ch2.fit_transform(B,c)
clf = svm.SVC()
clf.fit(X_train,a.target)
y = vectorizer.transform(open('D:/data/PRE/chrome.txt')))
X_test = ch2.transform(y)
print(clf.predict(X_test))
输出
[1 1 1 ..., 1 1 1]
问题 我只给了一个文件作为测试集。这就是所有内容都存储在单个数组中。然后它是如何在输出中给出几个1。应该只有一个1.另一个问题是,对于任何测试数据,它总是给出一个1的数组作为输出。它是二进制类,它不返回另一个类。怎么办?
答案 0 :(得分:0)
我解决了一个问题。在输出中为单个文件生成单个1。我只需更改以下代码行。
docs=['D:/data/PRE/office.txt']
y = vectorizer.transform(docs)