如何按标签分隔文件按特定列对项目进行分组

时间:2013-05-28 11:55:16

标签: ruby

我在制表符分隔的文本文件中有以下记录:

sku title   Product Type                        
19686940    This is test Title1 toys                        
19686941    This is test Title2 toys                        
19686942    This is test Title3 toys                        
20519300    This is test Title1 toys2                       
20519301    This is test Title2 toys2
20580987    This is test Title1 toys3                       
20580988    This is test Title2 toys3                       
20582176    This is test Title1 toys4   

如何按Product Type对项目进行分组,并找到title中的所有唯一字词?

输出格式:

Product Type   Unique_words 
------------   ------------ 
toys           This is test Title1 Title2 Title3
toys2          This is test Title1 Title2
toys3          This is test Title1 Title2
toys4          This is test Title1

更新
    直到现在我已经完成了代码,直到读取文件并存储到数组中:

class Product
    attr_reader :sku, :title, :productType
    def initialize(sku,title,productType)
      @sku = sku
      @title = title
      @productType = productType
    end

    def sku
      @sku
    end

    def title
      @title
    end

    def productType
      @productType
    end
end

class FileReader
  def ReadFile(m_FilePath)
    array = Array.new
    lines = IO.readlines(m_FilePath)

    lines.each_with_index do |line, i|
      current_row = line.split("\t")
      product = Product.new(current_row[0],current_row[1],current_row[2])

      array.push product
    end
  end
end

filereader_method = FileReader.new.method("ReadFile")
Reading =  filereader_method.to_proc

puts Reading.call("Input.txt")  

1 个答案:

答案 0 :(得分:0)

要进行分组,您可以使用Enumerable#group_by

Product = Struct.new(:sku, :title, :product_type)

def products_by_type(file_path)
  File.open(file_path)
      .map{ |line| Product.new(*line.chomp.split("\t")) }
      .group_by{ |product| product.product_type }
end

Ruby的美妙之处在于你有很多选择。您还可以查看CSV lib和OpenStruct,因为这只是一个数据对象:

require 'csv'
require 'ostruct'

def products_by_type(file_path)
  csv_opts = { col_sep: "\t",
               headers: true,
               header_converters: [:downcase, :symbol] }

  CSV.open(file_path, csv_opts)
     .map{ |row| OpenStruct.new row.to_hash }
     .group_by{ |product| product.product_type }
end

或者使用基于哈希键的创作成语来删除上面#to_hash上对row的调用:

class Product
  attr_accessor :sku, :title, :product_type

  def initialize(data)
    data.each{ |key, value| self.key = value }
  end
end

def products_by_type(file_path)
  csv_opts = { #... }

  CSV.open(file_path, csv_opts)
     .map{ |row| Product.new row }
     .group_by{ |product| product.product_type }
end

然后根据哈希值,根据需要格式化输出:

def unique_title_words(*products)
  products.flat_map{ |product| product.title.scan(/\w+/) }
          .unique
end

puts "Product Type\tUnique Words"
products_by_type("./file.txt").each do |type, products|
  puts "#{type}\t#{unique_title_words products}"
end