我有这个原始文本:
________________________________________________________________________________________________________________________________
Pos Car Competitor/Team Driver Vehicle Cap CL Laps Race.Time Fastest...Lap
1 6 Jason Clements Jason Clements BMW M3 3200 10 9:48.5710 3 0:57.3228*
2 42 David Skillender David Skillender Holden VS Commodore 6000 10 9:55.6866 2 0:57.9409
3 37 Bruce Cook Bruce Cook Ford Escort 3759 10 9:56.4388 4 0:58.3359
4 18 Troy Marinelli Troy Marinelli Nissan Silvia 3396 10 9:56.7758 2 0:58.4443
5 75 Anthony Gilbertson Anthony Gilbertson BMW M3 3200 10 10:02.5842 3 0:58.9336
6 26 Trent Purcell Trent Purcell Mazda RX7 2354 10 10:07.6285 4 0:59.0546
7 12 Scott Hunter Scott Hunter Toyota Corolla 2000 10 10:11.3722 5 0:59.8921
8 91 Graeme Wilkinson Graeme Wilkinson Ford Escort 2000 10 10:13.4114 5 1:00.2175
9 7 Justin Wade Justin Wade BMW M3 4000 10 10:18.2020 9 1:00.8969
10 55 Greg Craig Grag Craig Toyota Corolla 1840 10 10:18.9956 7 1:00.7905
11 46 Kyle Orgam-Moore Kyle Organ-Moore Holden VS Commodore 6000 10 10:30.0179 3 1:01.6741
12 39 Uptiles Strathpine Trent Spencer BMW Mini Cooper S 1500 10 10:40.1436 2 1:02.2728
13 177 Mark Hyde Mark Hyde Ford Escort 1993 10 10:49.5920 2 1:03.8069
14 34 Peter Draheim Peter Draheim Mazda RX3 2600 10 10:50.8159 10 1:03.4396
15 5 Scott Douglas Scott Douglas Datsun 1200 1998 9 9:48.7808 3 1:01.5371
16 72 Paul Redman Paul Redman Ford Focus 2lt 9 10:11.3707 2 1:05.8729
17 8 Matthew Speakman Matthew Speakman Toyota Celica 1600 9 10:16.3159 3 1:05.9117
18 74 Lucas Easton Lucas Easton Toyota Celica 1600 9 10:16.8050 6 1:06.0748
19 77 Dean Fuller Dean Fuller Mitsubishi Sigma 2600 9 10:25.2877 3 1:07.3991
20 16 Brett Batterby Brett Batterby Toyota Corolla 1600 9 10:29.9127 4 1:07.8420
21 95 Ross Hurford Ross Hurford Toyota Corolla 1600 8 9:57.5297 2 1:12.2672
DNF 13 Charles Wright Charles Wright BMW 325i 2700 9 9:47.9888 7 1:03.2808
DNF 20 Shane Satchwell Shane Satchwell Datsun 1200 Coupe 1998 1 1:05.9100 1 1:05.9100
Fastest Lap Av.Speed Is 152kph, Race Av.Speed Is 148kph
R=under lap record by greatest margin, r=under lap record, *=fastest lap time
________________________________________________________________________________________________________________________________
Issue# 2 - Printed Sat May 26 15:43:31 2012 Timing System By NATSOFT (03)63431311 www.natsoft.com.au/results
Amended
我需要将它解析为具有明显位置,汽车,驱动程序等字段的对象。问题是我不知道使用什么样的策略。如果我将它拆分为空格,我最终会得到一个如下列表:
["1", "6", "Jason", "Clements", "Jason", "Clements", "BMW", "M3", "3200", "10", "9:48.5710", "3", "0:57.3228*"]
你能看到这个问题。我不能只解释这个列表,因为人们可能只有一个名字,或一个名字中的3个单词,或汽车中的许多不同的单词。它使得仅使用索引来引用列表是不可能的。
如何使用列名定义的偏移量?我不太清楚如何使用它。
编辑:所以我使用的当前算法的工作原理如下:
存在几个问题:
如果名称包含相同的长度,请执行以下操作:
Jason Adams
Bobby Sacka
Jerry Louis
然后它会将其解释为两个单独的项目:(["Jason" "Adams", "Bobby", "Sacka", "Jerry", "Louis"]
)。
然而,如果它们都如此不同:
Dominic Bou
Bob Adams
Jerry Seinfeld
然后它会在Seinfeld的最后一个'd'上正确分割(因此我们会得到三个名字的集合(["Dominic Bou", "Bob Adams", "Jerry Seinfeld"]
)。
它也很脆弱。我正在寻找一个更好的解决方案。
答案 0 :(得分:6)
这对正则表达式来说不是一个好例子,你真的想要发现格式然后解压缩行:
lines = str.split "\n"
# you know the field names so you can use them to find the column positions
fields = ['Pos', 'Car', 'Competitor/Team', 'Driver', 'Vehicle', 'Cap', 'CL Laps', 'Race.Time', 'Fastest...Lap']
header = lines.shift until header =~ /^Pos/
positions = fields.map{|f| header.index f}
# use that to construct an unpack format string
format = 1.upto(positions.length-1).map{|x| "A#{positions[x] - positions[x-1]}"}.join
# A4A5A31A25A21A6A12A10
lines.each do |line|
next unless line =~ /^(\d|DNF)/ # skip lines you're not interested in
data = line.unpack(format).map{|x| x.strip}
puts data.join(', ')
# or better yet...
car = Hash[fields.zip data]
puts car['Driver']
end
答案 1 :(得分:6)
http://blog.ryanwood.com/past/2009/6/12/slither-a-dsl-for-parsing-fixed-width-text-files这可以解决您的问题。
here是更多的例子和github。
希望这有帮助!
答案 2 :(得分:5)
我认为在每一行上使用固定宽度很容易。
#!/usr/bin/env ruby
# ruby parsing_winner.rb winners_list.txt
args = ARGV
puts "ruby parsing_winner.rb winners_list.txt " if args.empty?
winner_file = open args.shift
array_of_race_results, array_of_race_results_array = [], []
class RaceResult
attr_accessor :position, :car, :team, :driver, :vehicle, :cap, :cl_laps, :race_time, :fastest, :fastest_lap
def initialize(position, car, team, driver, vehicle, cap, cl_laps, race_time, fastest, fastest_lap)
@position = position
@car = car
@team = team
@driver = driver
@vehicle = vehicle
@cap = cap
@cl_laps = cl_laps
@race_time = race_time
@fastest = fastest
@fastest_lap = fastest_lap
end
def to_a
# ["1", "6", "Jason", "Clements", "Jason", "Clements", "BMW", "M3", "3200", "10", "9:48.5710", "3", "0:57.3228*"]
[position, car, team, driver, vehicle, cap, cl_laps, race_time, fastest, fastest_lap]
end
end
# Pos Car Competitor/Team Driver Vehicle Cap CL Laps Race.Time Fastest...Lap
# 1 6 Jason Clements Jason Clements BMW M3 3200 10 9:48.5710 3 0:57.3228*
# 2 42 David Skillender David Skillender Holden VS Commodore 6000 10 9:55.6866 2 0:57.9409
# etc...
winner_file.each_line do |line|
next if line[/^____/] || line[/^\w{4,}|^\s|^Pos/] || line[0..3][/\=/]
position = line[0..3].strip
car = line[4..8].strip
team = line[9..39].strip
driver = line[40..64].strip
vehicle = line[65..85].strip
cap = line[86..91].strip
cl_laps = line[92..101].strip
race_time = line[102..113].strip
fastest = line[114..116].strip
fastest_lap = line[117..-1].strip
racer = RaceResult.new(position, car, team, driver, vehicle, cap, cl_laps, race_time, fastest, fastest_lap)
array_of_race_results << racer
array_of_race_results_array << racer.to_a
end
puts "Race Results Objects: #{array_of_race_results}"
puts "Race Results: #{array_of_race_results_array.inspect}"
输出=&gt;
Race Results Objects: [#<RaceResult:0x007fcc4a84b7c8 @position="1", @car="6", @team="Jason Clements", @driver="Jason Clements", @vehicle="BMW M3", @cap="3200", @cl_laps="10", @race_time="9:48.5710", @fastest="3", @fastest_lap="0:57.3228*">, #<RaceResult:0x007fcc4a84aa08 @position="2", @car="42", @team="David Skillender", @driver="David Skillender", @vehicle="Holden VS Commodore", @cap="6000", @cl_laps="10", @race_time="9:55.6866", @fastest="2", @fastest_lap="0:57.9409">, #<RaceResult:0x007fcc4a849ce8 @position="3", @car="37", @team="Bruce Cook", @driver="Bruce Cook", @vehicle="Ford Escort", @cap="3759", @cl_laps="10", @race_time="9:56.4388", @fastest="4", @fastest_lap="0:58.3359">, #<RaceResult:0x007fcc4a8491f8 @position="4", @car="18", @team="Troy Marinelli", @driver="Troy Marinelli", @vehicle="Nissan Silvia", @cap="3396", @cl_laps="10", @race_time="9:56.7758", @fastest="2", @fastest_lap="0:58.4443">, #<RaceResult:0x007fcc4b091ab8 @position="5", @car="75", @team="Anthony Gilbertson", @driver="Anthony Gilbertson", @vehicle="BMW M3", @cap="3200", @cl_laps="10", @race_time="10:02.5842", @fastest="3", @fastest_lap="0:58.9336">, #<RaceResult:0x007fcc4b0916a8 @position="6", @car="26", @team="Trent Purcell", @driver="Trent Purcell", @vehicle="Mazda RX7", @cap="2354", @cl_laps="10", @race_time="10:07.6285", @fastest="4", @fastest_lap="0:59.0546">, #<RaceResult:0x007fcc4b091298 @position="7", @car="12", @team="Scott Hunter", @driver="Scott Hunter", @vehicle="Toyota Corolla", @cap="2000", @cl_laps="10", @race_time="10:11.3722", @fastest="5", @fastest_lap="0:59.8921">, #<RaceResult:0x007fcc4b090e88 @position="8", @car="91", @team="Graeme Wilkinson", @driver="Graeme Wilkinson", @vehicle="Ford Escort", @cap="2000", @cl_laps="10", @race_time="10:13.4114", @fastest="5", @fastest_lap="1:00.2175">, #<RaceResult:0x007fcc4b090a78 @position="9", @car="7", @team="Justin Wade", @driver="Justin Wade", @vehicle="BMW M3", @cap="4000", @cl_laps="10", @race_time="10:18.2020", @fastest="9", @fastest_lap="1:00.8969">, #<RaceResult:0x007fcc4b090668 @position="10", @car="55", @team="Greg Craig", @driver="Grag Craig", @vehicle="Toyota Corolla", @cap="1840", @cl_laps="10", @race_time="10:18.9956", @fastest="7", @fastest_lap="1:00.7905">, #<RaceResult:0x007fcc4b090258 @position="11", @car="46", @team="Kyle Orgam-Moore", @driver="Kyle Organ-Moore", @vehicle="Holden VS Commodore", @cap="6000", @cl_laps="10", @race_time="10:30.0179", @fastest="3", @fastest_lap="1:01.6741">, #<RaceResult:0x007fcc4b08fe48 @position="12", @car="39", @team="Uptiles Strathpine", @driver="Trent Spencer", @vehicle="BMW Mini Cooper S", @cap="1500", @cl_laps="10", @race_time="10:40.1436", @fastest="2", @fastest_lap="1:02.2728">, #<RaceResult:0x007fcc4b08fa38 @position="13", @car="177", @team="Mark Hyde", @driver="Mark Hyde", @vehicle="Ford Escort", @cap="1993", @cl_laps="10", @race_time="10:49.5920", @fastest="2", @fastest_lap="1:03.8069">, #<RaceResult:0x007fcc4b08f628 @position="14", @car="34", @team="Peter Draheim", @driver="Peter Draheim", @vehicle="Mazda RX3", @cap="2600", @cl_laps="10", @race_time="10:50.8159", @fastest="10", @fastest_lap="1:03.4396">, #<RaceResult:0x007fcc4b08f218 @position="15", @car="5", @team="Scott Douglas", @driver="Scott Douglas", @vehicle="Datsun 1200", @cap="1998", @cl_laps="9", @race_time="9:48.7808", @fastest="3", @fastest_lap="1:01.5371">, #<RaceResult:0x007fcc4b08ee08 @position="16", @car="72", @team="Paul Redman", @driver="Paul Redman", @vehicle="Ford Focus", @cap="2lt", @cl_laps="9", @race_time="10:11.3707", @fastest="2", @fastest_lap="1:05.8729">, #<RaceResult:0x007fcc4b08e9f8 @position="17", @car="8", @team="Matthew Speakman", @driver="Matthew Speakman", @vehicle="Toyota Celica", @cap="1600", @cl_laps="9", @race_time="10:16.3159", @fastest="3", @fastest_lap="1:05.9117">, #<RaceResult:0x007fcc4b08e5e8 @position="18", @car="74", @team="Lucas Easton", @driver="Lucas Easton", @vehicle="Toyota Celica", @cap="1600", @cl_laps="9", @race_time="10:16.8050", @fastest="6", @fastest_lap="1:06.0748">, #<RaceResult:0x007fcc4b08e1d8 @position="19", @car="77", @team="Dean Fuller", @driver="Dean Fuller", @vehicle="Mitsubishi Sigma", @cap="2600", @cl_laps="9", @race_time="10:25.2877", @fastest="3", @fastest_lap="1:07.3991">, #<RaceResult:0x007fcc4b08ddc8 @position="20", @car="16", @team="Brett Batterby", @driver="Brett Batterby", @vehicle="Toyota Corolla", @cap="1600", @cl_laps="9", @race_time="10:29.9127", @fastest="4", @fastest_lap="1:07.8420">, #<RaceResult:0x007fcc4a848348 @position="21", @car="95", @team="Ross Hurford", @driver="Ross Hurford", @vehicle="Toyota Corolla", @cap="1600", @cl_laps="8", @race_time="9:57.5297", @fastest="2", @fastest_lap="1:12.2672">, #<RaceResult:0x007fcc4a847948 @position="DNF", @car="13", @team="Charles Wright", @driver="Charles Wright", @vehicle="BMW 325i", @cap="2700", @cl_laps="9", @race_time="9:47.9888", @fastest="7", @fastest_lap="1:03.2808">, #<RaceResult:0x007fcc4a847010 @position="DNF", @car="20", @team="Shane Satchwell", @driver="Shane Satchwell", @vehicle="Datsun 1200 Coupe", @cap="1998", @cl_laps="1", @race_time="1:05.9100", @fastest="1", @fastest_lap="1:05.9100">]
Race Results: [["1", "6", "Jason Clements", "Jason Clements", "BMW M3", "3200", "10", "9:48.5710", "3", "0:57.3228*"], ["2", "42", "David Skillender", "David Skillender", "Holden VS Commodore", "6000", "10", "9:55.6866", "2", "0:57.9409"], ["3", "37", "Bruce Cook", "Bruce Cook", "Ford Escort", "3759", "10", "9:56.4388", "4", "0:58.3359"], ["4", "18", "Troy Marinelli", "Troy Marinelli", "Nissan Silvia", "3396", "10", "9:56.7758", "2", "0:58.4443"], ["5", "75", "Anthony Gilbertson", "Anthony Gilbertson", "BMW M3", "3200", "10", "10:02.5842", "3", "0:58.9336"], ["6", "26", "Trent Purcell", "Trent Purcell", "Mazda RX7", "2354", "10", "10:07.6285", "4", "0:59.0546"], ["7", "12", "Scott Hunter", "Scott Hunter", "Toyota Corolla", "2000", "10", "10:11.3722", "5", "0:59.8921"], ["8", "91", "Graeme Wilkinson", "Graeme Wilkinson", "Ford Escort", "2000", "10", "10:13.4114", "5", "1:00.2175"], ["9", "7", "Justin Wade", "Justin Wade", "BMW M3", "4000", "10", "10:18.2020", "9", "1:00.8969"], ["10", "55", "Greg Craig", "Grag Craig", "Toyota Corolla", "1840", "10", "10:18.9956", "7", "1:00.7905"], ["11", "46", "Kyle Orgam-Moore", "Kyle Organ-Moore", "Holden VS Commodore", "6000", "10", "10:30.0179", "3", "1:01.6741"], ["12", "39", "Uptiles Strathpine", "Trent Spencer", "BMW Mini Cooper S", "1500", "10", "10:40.1436", "2", "1:02.2728"], ["13", "177", "Mark Hyde", "Mark Hyde", "Ford Escort", "1993", "10", "10:49.5920", "2", "1:03.8069"], ["14", "34", "Peter Draheim", "Peter Draheim", "Mazda RX3", "2600", "10", "10:50.8159", "10", "1:03.4396"], ["15", "5", "Scott Douglas", "Scott Douglas", "Datsun 1200", "1998", "9", "9:48.7808", "3", "1:01.5371"], ["16", "72", "Paul Redman", "Paul Redman", "Ford Focus", "2lt", "9", "10:11.3707", "2", "1:05.8729"], ["17", "8", "Matthew Speakman", "Matthew Speakman", "Toyota Celica", "1600", "9", "10:16.3159", "3", "1:05.9117"], ["18", "74", "Lucas Easton", "Lucas Easton", "Toyota Celica", "1600", "9", "10:16.8050", "6", "1:06.0748"], ["19", "77", "Dean Fuller", "Dean Fuller", "Mitsubishi Sigma", "2600", "9", "10:25.2877", "3", "1:07.3991"], ["20", "16", "Brett Batterby", "Brett Batterby", "Toyota Corolla", "1600", "9", "10:29.9127", "4", "1:07.8420"], ["21", "95", "Ross Hurford", "Ross Hurford", "Toyota Corolla", "1600", "8", "9:57.5297", "2", "1:12.2672"], ["DNF", "13", "Charles Wright", "Charles Wright", "BMW 325i", "2700", "9", "9:47.9888", "7", "1:03.2808"], ["DNF", "20", "Shane Satchwell", "Shane Satchwell", "Datsun 1200 Coupe", "1998", "1", "1:05.9100", "1", "1:05.9100"]]
答案 3 :(得分:4)
根据格式的一致性,您可以使用正则表达式。
这是一个适用于当前数据的示例正则表达式 - 可能需要根据精确的规则进行调整,但它提供了这个想法:
^
# Pos
(\d+|DNF)
\s+
#Car
(\d+)
\s+
# Team
([\w-]+(?: [\w-]+)+)
\s+
# Driver
([\w-]+(?: [\w-]+)+)
\s+
# Vehicle
([\w-]+(?: ?[\w-]+)+)
\s+
# Cap
(\d{4}|\dlt)
\s+
# CL Laps
(\d+)
\s+
# Race.Time
(\d+:\d+\.\d+)
\s+
# Fastest Lap
(\d+)
\s+
# Fastest Lap Time
(\d+:\d+\.\d+\*?)
\s*
$
答案 4 :(得分:4)
如果您可以验证空格是空格字符而不是制表符,并且过长的文本总是被截断以适合列结构,那么我会对切片边界进行硬编码:
parsed = [rawLine[0:3],rawLine[4:7],rawLine[9:38], ...etc... ]
根据数据源的不同,这可能很脆弱(例如,如果每次运行都有不同的列宽)。
如果标题行始终相同,则可以通过搜索标题行的已知单词来提取切片边界。
答案 5 :(得分:4)
您可以使用fixed_width
gem。
您可以使用以下代码解析您的给定文件:
require 'fixed_width'
require 'pp'
FixedWidth.define :cars do |d|
d.head do |head|
head.trap { |line| line !~ /\d/ }
end
d.body do |body|
body.trap { |line| line =~ /^(\d|DNF)/ }
body.column :pos, 4
body.column :car, 5
body.column :competitor, 31
body.column :driver, 25
body.column :vehicle, 21
body.column :cap, 5
body.column :cl_laps, 11
body.column :race_time, 11
body.column :fast_lap_no, 4
body.column :fast_lap_time, 10
end
end
pp FixedWidth.parse(File.open("races.txt"), :cars)
trap
方法标识每个部分中的行。我使用了正则表达式:
head
正则表达式查找不包含数字的行。body
正则表达式查找以数字或“DNF”每个部分必须包含紧接在最后一行之后的行。 column
定义只是标识要抓取的列数。该库为您删除空白。如果你想生成一个固定宽度的文件,你可以添加对齐参数,但是看起来你不需要它。
结果是一个以这样开头的哈希:
{:head=>[{}, {}, {}],
:body=>
[{:pos=>"1",
:car=>"6",
:competitor=>"Jason Clements",
:driver=>"Jason Clements",
:vehicle=>"BMW M3",
:cap=>"3200",
:cl_laps=>"10",
:race_time=>"9:48.5710",
:fast_lap_no=>"3",
:fast_lap_time=>"0:57.3228"},
{:pos=>"2",
:car=>"42",
:competitor=>"David Skillender",
:driver=>"David Skillender",
:vehicle=>"Holden VS Commodore",
:cap=>"6000",
:cl_laps=>"10",
:race_time=>"9:55.6866",
:fast_lap_no=>"2",
:fast_lap_time=>"0:57.9409"},
答案 6 :(得分:4)
好吧,我知道了:
修改:我忘了提及,假设您已将输入文本存储在变量input_string
# Choose a delimeter that is unlikely to occure
DELIM = '|||'
# DRY -> extend String
class String
def split_on_spaces(min_spaces = 1)
self.strip.gsub(/\s{#{min_spaces},}/, DELIM).split(DELIM)
end
end
# just get the data lines
lines = input_string.split("\n")
lines = lines[2...(lines.length - 4)].delete_if { |line|
line.empty?
}
# Grab all the entries into a nice 2-d array
entries = lines.map { |line|
[
line[0..8].split_on_spaces,
line[9..85].split_on_spaces(3).map{ |string|
string.gsub(/\s+/, ' ') # replace whitespace with 1 space
},
line[85...line.length].split_on_spaces(2)
].flatten
}
# BONUS
# Make nice hashes
keys = [:pos, :car, :team, :driver, :vehicle, :cap, :cl_laps, :race_time, :fastest_lap]
objects = entries.map { |entry|
Hash[keys.zip entry]
}
输出:
entries # =>
["1", "6", "Jason Clements", "Jason Clements", "BMW M3", "3200", "10", "9:48.5710", "3 0:57.3228*"]
["2", "42", "David Skillender", "David Skillender", "Holden VS Commodore", "6000", "10", "9:55.6866", "2 0:57.9409"]
...
# all of length 9, no extra spaces
如果数组只是不削减它
objects # =>
{:pos=>"1", :car=>"6", :team=>"Jason Clements", :driver=>"Jason Clements", :vehicle=>"BMW M3", :cap=>"3200", :cl_laps=>"10", :race_time=>"9:48.5710", :fastest_lap=>"3 0:57.3228*"}
{:pos=>"2", :car=>"42", :team=>"David Skillender", :driver=>"David Skillender", :vehicle=>"Holden VS Commodore", :cap=>"6000", :cl_laps=>"10", :race_time=>"9:55.6866", :fastest_lap=>"2 0:57.9409"}
...
我将它重构为适合你的功能。
答案 7 :(得分:3)
除非有关于如何分隔列的明确规则,否则你无法真正做到这一点。
您采用的方法很好,假设您知道每个列值都正确缩进到列标题。
另一种方法可能是将仅由一个空格分隔的单词组合在一起(从您提供的文本中,我可以看到此规则也成立)。
答案 8 :(得分:2)
假设文本的间距始终相同,您可以根据位置拆分字符串,然后剥去每个部分周围的额外空格。例如,在python中:
pos=row[0:3].strip()
car=row[4:7].strip()
等等。或者,您可以定义正则表达式来捕获每个部分:
([:alnum:]+)\s([:num:]+)\s(([:alpha:]+ )+)\s(([:alpha:]+ )+)\s(([:alpha:]* )+)\s
等等。 (确切的语法取决于你的正则表达式语法。)请注意,汽车正则表达式需要处理添加的空格。
答案 9 :(得分:1)
我不会对此进行编码,但是一种绝对适用于上述数据集的方法是通过空格分析它然后以这种方式分配元素:
someArray = array of strings that were split by white space
Pos = someArray[0]
Car = someArray[1]
Competitor/Team = someArray[2] + " " + someArray[3]
Driver = someArray[4] + " " + someArray[5]
Vehicle = someArray[6] + " " + ... + " " + someArray[someArray.length - 6]
Cap = someArray[someArray.length - 5]
CL Laps = someArray[someArray.length - 4]
Race.Time = someArray[someArray.length - 3]
Fastest...Lap = someArray[someArray.length - 2] + " " + someArray[someArray.length - 1]
车辆部件可以通过某种for或while循环来完成。