我需要计算几句话的tfidf矩阵。句子包括数字和单词。 我使用下面的代码来做到这一点
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
data1=['1/8 wire','4 tube','1-1/4 brush']
dataset=pd.DataFrame(data1, columns=['des'])
vectorizer1 = TfidfVectorizer(lowercase=False)
tf_idf_matrix = pd.DataFrame(vectorizer1.fit_transform(dataset['des']).toarray(),columns=vectorizer1.get_feature_names())
Tfidf函数只考虑单词作为其词汇表,即
Out[3]: ['brush', 'tube', 'wire']
但我需要数字才能成为令牌的一部分
预期
Out[3]: ['brush', 'tube', 'wire','1/8','4','1-1/4']
在阅读TfidfVectorizer文档后,我发现必须更改 token_pattern 和 tokenizer 参数。但我不知道如何改变它来考虑数字和标点符号。
任何人都可以告诉我如何更改参数。
答案 0 :(得分:1)
您可以在token_pattern参数中明确指出要解析的符号:
tfidf = TfidfVectorizer(token_pattern = token_pattern_)
其中{1,}表示单词应包含的最小符号数。结束,然后将其作为参数传递给token_pattern:
func UPLOD(){
let parameters = [
"Folder": "Asif",
"Filename" : "ring",
"Ext" : ".png"
]
guard let token = UserDefaults.standard.string(forKey: "accesstoken") else {
return
}
print("Create button ACCESS KEY::::- \(token)")
let headers = [
"x-access-token": token,
"Content-Type": "form-data"
]
let image = myImageView.image
let imgData = image!.jpegData(compressionQuality: 0.7)!
Alamofire.upload(multipartFormData: { (multipartFormData) in
multipartFormData.append(imgData, withName: "filedata", fileName: "image.png", mimeType: "image/png")
print("mutlipart 1st \(multipartFormData)")
for (key, value) in parameters
{
multipartFormData.append(value.data(using: String.Encoding.utf8)!, withName: key)
}
}, to:"http://192.168.80.21:8800/api/v1/upload/uploadfile", method:.post, headers:headers)
{
(result) in
switch result {
case .success(let upload, _, _):
print("x:::::::::\(result)")
print("Upload;;\(upload)")
upload.uploadProgress(closure: { (progress) in
print("Upload Progress: \(progress)")
})
upload.responseJSON { response in
print(response.result.value as Any)
//print("Response: \(response)")
}
upload.response{ response in
print(response)
}
case .failure(let encodingError):
print(encodingError)
}
}
}