我有一个像this这样的纯文本表格。我需要对结果行进行分组,以便将数据放在各自的列中。
我可以在空格上分割字符串(一行),然后我会得到一个类似的数组:
["2", "1/47", "M4044", "25:03*", "856", "12:22", "12:41", "17.52", "Some", "Name", "Yo", "Prairie", "Inn", "Harriers", "Runni", "25:03"]
我也可以拆分两个空格,这让我很接近,但仍然不一致,正如你所看到的那样:
["2", " 1/47", "M4044", " 25:03*", "856", " 12:22", " 12:41", "17.52 Some Name Yo", "", "", "", "", "", "", "Prairie Inn Harriers Runni", " 25:03 "]
我可以指定要加入哪些索引,但是我需要抓取可能数千个这样的文件,并且列并不总是按照相同的顺序。
一个常量是列数据永远不会长于列名和数据之间的分隔符(====
)。我试图利用这个优势,但发现了一些漏洞。
我需要编写一个算法来检测名称列中的内容以及其他“单词”列中的内容。有什么想法吗?
答案 0 :(得分:4)
首先我们设置问题:
data = <<EOF
Place Div/Tot Div Guntime PerF 1sthalf 2ndhalf 100m Name Club Nettime
===== ======= ===== ======= ==== ======= ======= ====== ========================= ========================== =======
1 1/24 M3034 24:46 866 12:11 12:35 15.88 Andy Bas Prairie Inn Harriers 24:46
2 1/47 M4044 25:03* 856 12:22 12:41 17.52 Some Name Yo Prairie Inn Harriers Runni 25:03
EOF
lines = data.split "\n"
我喜欢为String#unpack创建一个格式字符串:
format = lines[1].scan(/(=+)(\s+)/).map{|f, s| "A#{f.size}" + 'x' * s.size}.join
#=> A5xA7xA5xA7xxA4xA7xA7xA6xA25xA26xA7x
其余的很容易:
headers = lines[0].unpack format
lines[2..-1].each do |line|
puts Hash[headers.zip line.unpack(format).map(&:strip)]
end
#=> {"Place"=>"1", "Div/Tot"=>"1/24", "Div"=>"M3034", "Guntime"=>"24:46", "PerF"=>"866", "1sthalf"=>"12:11", "2ndhalf"=>"12:35", "100m"=>"15.88", "Name"=>"Andy Bas", "Club"=>"Prairie Inn Harriers", "Nettime"=>"24:46"}
#=> {"Place"=>"2", "Div/Tot"=>"1/47", "Div"=>"M4044", "Guntime"=>"25:03", "PerF"=>"856", "1sthalf"=>"12:22", "2ndhalf"=>"12:41", "100m"=>"17.52", "Name"=>"Some Name Yo", "Club"=>"Prairie Inn Harriers Runni", "Nettime"=>"25:03"}
答案 1 :(得分:2)
这应该有效
divider = "===== ======= ===== ======= ==== ======= ======= ====== ========================= ========================== ======="
str = " 1 1/24 M3034 24:46 866 12:11 12:35 15.88 Andy Bas Prairie Inn Harriers 24:46"
divider.split(/\s+/).each {|delimiter| puts str.slice!(0..delimiter.size).strip }
答案 2 :(得分:0)
这是一个有效的解决方案(基于你给定的文件 - 但我应该推广到这种形式的所有文件):
#!/usr/bin/env ruby
FILE = 'gistfile1.txt'
f = File.new(FILE,'r')
l = f.gets #read first line (which contains the headers)
#parse for the columns: header text, where they start,stop & len
headers = l.scan(/(\S+\W+)/).each.map{|s| [s.join, l.index(s.join), s.join.length]}.map{|a| {:head=>a[0].strip,:start=>a[1],:end=>a[1]+a[2],:len=>a[2]}}
f.gets #to skip past the header-data separator line
records = []
while( l = f.gets)
record = {}
headers.each{|h|
record[h[:head]] = l[h[:start]...h[:end]].strip
print "#{h[:head]} => #{record[h[:head]]}\n"
}
print "*" * l.length,"\n"
records << record
end
#records contains each record, with each column header mapped onto the respective data record
对于演示,我随着时间回应记录:
Place => 1
Div/Tot => 1/24
Div => M3034
Guntime => 24:46
PerF => 866
1sthalf => 12:11
2ndhalf => 12:35
100m => 15.88
Name => Andy Bas
Club => Prairie Inn Harriers
Nettime => 24:46
***********************************************************************************************************************
Place => 2
Div/Tot => 1/47
Div => M4044
Guntime => 25:03*
PerF => 856
1sthalf => 12:22
2ndhalf => 12:41
100m => 17.52
Name => Some Name Yo
Club => Prairie Inn Harriers Runni
Nettime => 25:03
**********************************************************************************************************************
答案 3 :(得分:0)
header, format, *data = plain_text_table.split($/)
h = {}
format.scan(/=+/) do
range = $~.begin(0)..$~.end(0)
h[header[range].strip] = data.map{|s| s[range].strip}
end
h # => {
"Place" => ["1", "2"],
"Div/Tot" => ["1/24", "1/47"],
"Div" => ["M3034", "M4044"],
"Guntime" => ["24:46", "25:03*"],
"PerF" => ["866", "856"],
"1sthalf" => ["12:11", "12:22"],
"2ndhalf" => ["12:35", "12:41"],
"100m" => ["15.88", "17.52"],
"Name" => ["Andy Bas", "Some Name Yo"],
"Club" => ["Prairie Inn Harriers", "Prairie Inn Harriers Runni"],
"Nettime" => ["24:46", "25:03"]
}