我的商店里装满了26597件独特的产品。
我用来将产品导入商店的数据如下所示:
{
"description":"AH Uien rood",
"category":"/Aardappel, groente, fruit/Kruiden, uien, knoflook/Uien/",
"brand":"AH"
}, {...}
26597个产品中有530个产品没有brand
价值。但是,品牌名称出现在说明中。对于上述示例产品,在"description":"AH Uien rood"
中,AH
是其品牌名称。品牌名称始终是说明中的前1个单词。但品牌名称的长度和字数各不相同,并且通常在两者之间有空格。因此,我不能简单地从描述中提取第一个单词并将其指定为产品品牌名称。
我认为我会使用机器学习来帮助我根据描述和类别对产品品牌进行分类。
这是我第一次使用机器学习的真实体验,我决定使用ai4r Ruby gem。它看起来很好,维护得很好并且有适当的记录here。
对于530种产品,只有13种得到分类,其余的则返回错误:
Ai4r::Classifiers::ModelFailureError: There was not enough information during training to do a proper induction for the data element ...
我不太明白,用于训练模型的DATA_SET
的大小为25266。
这就是我的代码:
require 'json'
require 'open-uri'
require 'csv'
require 'ai4r'
r = JSON.parse(open('http://goo.gl/2IHtVU') {|f| f.read }.force_encoding('UTF-8'))
def extract_categories(product)
a = product['category'].split('/')
a.delete('')
b = []
a.each { |category| b << category.gsub(',', ' -') }
c = b.join(', ')
end
nb = []
r.each {|p| nb << p if p['brand'].nil? }
DATA_LABELS = ["title", "category", "brand"]
DATA_SET = []
r.each {|pnb| DATA_SET << [pnb['description'], extract_categories(pnb), pnb['brand']] unless pnb['brand'].nil? || pnb['category'].nil? }
data_set = Ai4r::Data::DataSet.new(:data_items=>DATA_SET, :data_labels=>DATA_LABELS)
id3 = Ai4r::Classifiers::ID3.new.build(data_set)
classified = []
nb.each do |pnb|
begin
classified << id3.eval([ pnb['description'], extract_categories(pnb) ])
rescue => e
puts 'There was not enough information during training to do a proper induction for the data element, moving on...'
end
end
classified.size
# => 13
# Save DATA_SET to csv
# CSV.open('/data_set.csv','wb', :quote_char => '"', encoding: "UTF-8") do |csv|
# csv << DATA_LABELS
#
# DATA_SET.each do |data|
# csv << [data[0], data[1], data[2]]
# end
# end
#
# => https://gist.github.com/narzero/ba8c521a370326a57a68
根据描述对产品品牌名称进行分类的更好方法是什么?
答案 0 :(得分:3)
在这种情况下,我会选择Naive-Bayes分类器而不是决策树。它有一颗宝石。 stuff-classifier
在下面的代码中,我使用gem训练您的数据集,并对10个随机条目进行分类。我使用了描述进行培训而不是类别。了解性能如何。否则,您可以通过将类别组合到desciption中来包括类别,但是将类别标记添加到类似cattt之类的东西,以将类别标记与描述区分开来。
require 'json'
require 'open-uri'
require 'stuff-classifier'
r = JSON.parse(open('data_file.json') {|f| f.read }.force_encoding('UTF-8'))
def extract_categories(product)
a = product['category'].split('/')
a.delete('')
b = []
a.each { |category| b << category.gsub(',', ' -') }
c = b.join(', ')
end
nb = []
r.each {|p| nb << p if p['brand'].nil? }
DATA_LABELS = ["title", "category", "brand"]
DATA_SET = []
r.each {|pnb| DATA_SET << [pnb['description'], extract_categories(pnb), pnb['brand']] unless pnb['brand'].nil? || pnb['category'].nil? }
cls = StuffClassifier::Bayes.new("Prodcut Label")
#train the classifier by feeding it the label and then the features
DATA_SET.each do |record|
begin
cls.train(record[2], record[0])
rescue
end
end
# print 10 random classifications
1.upto(10){
random_entry = DATA_SET.sample[0]
puts "#{random_entry} - Classified as - #{cls.classify(random_entry)}"
}
结果:
Organix Goodies squeezy banaan,aardbei&amp; zuivel - 分类为 - Organix
AH Dames hipster elastisch zwart maat M =&gt;约翰卡博特/ AH
Piramide Sterrenmix公平贸易=&gt; - Piramide
Royal Club Bitter lemon =&gt;皇家俱乐部
AH Fruitbiscuit酸奶/ aardbei =&gt; AH
Toni&amp; Guy Mask重建治疗=&gt;托尼&amp;盖
AH Kinder enkelsok wit mt 23-26 =&gt; AH
Theramed Aardbei junior 6+ jaar =&gt;的Theramed
Arla Bio drinkyoghurt limoen / munt =&gt;阿拉
AH Rauwkost Amsterdamse ui =&gt; AH