python中tfidfvectorizer sklearn中数字的标记模式

时间:2018-05-24 07:03:13

标签: python scikit-learn tokenize tfidfvectorizer

我需要计算几句话的tfidf矩阵。句子包括数字和单词。 我使用下面的代码来做到这一点

import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer

data1=['1/8 wire','4 tube','1-1/4 brush']
dataset=pd.DataFrame(data1, columns=['des'])
vectorizer1 = TfidfVectorizer(lowercase=False)
tf_idf_matrix = pd.DataFrame(vectorizer1.fit_transform(dataset['des']).toarray(),columns=vectorizer1.get_feature_names())

Tfidf函数只考虑单词作为其词汇表,即

Out[3]: ['brush', 'tube', 'wire']

但我需要数字才能成为令牌的一部分

预期

Out[3]: ['brush', 'tube', 'wire','1/8','4','1-1/4']

在阅读TfidfVectorizer文档后,我发现必须更改 token_pattern tokenizer 参数。但我不知道如何改变它来考虑数字和标点符号。

任何人都可以告诉我如何更改参数。

1 个答案:

答案 0 :(得分:1)

您可以在token_pattern参数中明确指出要解析的符号:


    tfidf = TfidfVectorizer(token_pattern = token_pattern_)

其中{1,}表示单词应包含的最小符号数。结束,然后将其作为参数传递给token_pattern:

func UPLOD(){
    let parameters = [
                "Folder": "Asif",
                "Filename" : "ring",
                "Ext" : ".png"
            ]
    guard let token = UserDefaults.standard.string(forKey: "accesstoken") else {
        return
    }
    print("Create button ACCESS KEY::::- \(token)")
    let headers = [
        "x-access-token": token,
        "Content-Type": "form-data"
    ]
    
    let image = myImageView.image
    let imgData = image!.jpegData(compressionQuality: 0.7)!

    Alamofire.upload(multipartFormData: { (multipartFormData) in
        multipartFormData.append(imgData, withName: "filedata", fileName: "image.png", mimeType: "image/png")
            print("mutlipart 1st \(multipartFormData)")

        for (key, value) in parameters
                    {
                        multipartFormData.append(value.data(using: String.Encoding.utf8)!, withName: key)
                    }

        }, to:"http://192.168.80.21:8800/api/v1/upload/uploadfile", method:.post, headers:headers)
        {
        (result) in
            switch result {
            case .success(let upload, _, _):
                print("x:::::::::\(result)")
                print("Upload;;\(upload)")
                    upload.uploadProgress(closure: { (progress) in
                        print("Upload Progress: \(progress)")
                    })
                    upload.responseJSON { response in
                        print(response.result.value as Any)
                        //print("Response:     \(response)")
                    }
                upload.response{ response in
                    print(response)
                }
                case .failure(let encodingError):
                    print(encodingError)
                }
        }
    }