使用正则表达式从数据集中提取数据

时间:2011-02-14 20:57:52

标签: ruby regex

我有这个数据集:

LP3I22- M5
01174c-qbFD.raw
L2P2 + p LPI Full ms [150.00-1500.00]
Scan #: 1
RT: 6.11
m/z Intensity   Relative    Resolution  Charge  Baseline

  150.0119         67.3     0.00    152545.44       0.00       26.27
  150.0153         59.3     0.00    269991.72       0.00       26.28
  150.0156         66.1     0.00    288504.16       0.00       26.28
  150.0161         67.2     0.00    172425.14       0.00       26.28
  150.0330         78.9     0.00    167957.34       0.00       26.32
  150.0485         75.0     0.00    208783.14       0.00       26.35
  150.0603        166.2     0.00    220081.53       0.00       26.37
  150.0624         75.8     0.00    189976.39       0.00       26.38
  150.0866         70.1     0.00    233127.77       0.00       26.42
  150.0991         54.8     0.00    193755.25       0.00       26.45
  150.1136         62.9     0.00    184047.91       0.00       26.48
  150.1348         85.4     0.00    206299.06       0.00       26.52
  150.1410         68.7     0.00    225439.47       0.00       26.53
  150.1428         73.1     0.00    205324.42       0.00       26.54
  150.1498         61.2     0.00    199792.59       0.00       26.55
  150.1572         56.8     0.00    160342.95       0.00       26.57
  150.1583         71.4     0.00    187849.53       0.00       26.57
  150.1746         84.7     0.00    211934.81       0.00       26.60
  150.1777         81.2     0.00    251123.45       0.00       26.61
  150.2106         65.7     0.00    198830.13       0.00       26.67
  150.2144         53.7     0.00    190111.53       0.00       26.68
  150.2781         74.0     0.00    187803.52       0.00       26.81
  150.2807         90.7     0.00    174743.38       0.00       26.82

如何使用正则表达式提取数据结果?我对前7行并不感兴趣。

4 个答案:

答案 0 :(得分:6)

假设它位于名为data

的字符串中
number_re = /\s*(\d+\.\d+)\s*/
data.scan(/^#{number_re.source * 6}$/)

这将导致以下数组

[["150.0119", "67.3", "0.00", "152545.44", "0.00", "26.27"],
 ["150.0153", "59.3", "0.00", "269991.72", "0.00", "26.28"],
 ["150.0156", "66.1", "0.00", "288504.16", "0.00", "26.28"],
 ["150.0161", "67.2", "0.00", "172425.14", "0.00", "26.28"],
 ["150.0330", "78.9", "0.00", "167957.34", "0.00", "26.32"],
 ["150.0485", "75.0", "0.00", "208783.14", "0.00", "26.35"],
 ["150.0603", "166.2", "0.00", "220081.53", "0.00", "26.37"],
 ["150.0624", "75.8", "0.00", "189976.39", "0.00", "26.38"],
 ["150.0866", "70.1", "0.00", "233127.77", "0.00", "26.42"],
 ["150.0991", "54.8", "0.00", "193755.25", "0.00", "26.45"],
 ["150.1136", "62.9", "0.00", "184047.91", "0.00", "26.48"],
 ["150.1348", "85.4", "0.00", "206299.06", "0.00", "26.52"],
 ["150.1410", "68.7", "0.00", "225439.47", "0.00", "26.53"],
 ["150.1428", "73.1", "0.00", "205324.42", "0.00", "26.54"],
 ["150.1498", "61.2", "0.00", "199792.59", "0.00", "26.55"],
 ["150.1572", "56.8", "0.00", "160342.95", "0.00", "26.57"],
 ["150.1583", "71.4", "0.00", "187849.53", "0.00", "26.57"],
 ["150.1746", "84.7", "0.00", "211934.81", "0.00", "26.60"],
 ["150.1777", "81.2", "0.00", "251123.45", "0.00", "26.61"],
 ["150.2106", "65.7", "0.00", "198830.13", "0.00", "26.67"],
 ["150.2144", "53.7", "0.00", "190111.53", "0.00", "26.68"],
 ["150.2781", "74.0", "0.00", "187803.52", "0.00", "26.81"],
 ["150.2807", "90.7", "0.00", "174743.38", "0.00", "26.82"]]

答案 1 :(得分:3)

lines = IO.readlines('inputfile.txt')
data = lines[7..-1].collect{|x| x.scan(/([^\d]+[\d.]+)/).flatten.map{|y| y.strip}}

对于不涉及正则表达式的更简单的解决方案,请将最后一行替换为:

data = lines[7..-1].collect{|x| x.split}

这一切都假设数据集与您列出的数据集匹配,并且不包含任何意外或格式不正确的值。

答案 2 :(得分:1)

使用模式:

^\s*(\d+\.\d+)\s*(\d+\.\d+)\s*(\d+\.\d+)\s*(\d+\.\d+)\s*(\d+\.\d+)\s*(\d+\.\d+)\s*$

在多线模式下

答案 3 :(得分:1)

7.times{DATA.readline}  # discard first 7 lines
res = DATA.map{ |line| line.lstrip.squeeze.split(' ').map{|el| el.to_f } }

__END__
LP3I22- M5
01174c-qbFD.raw
L2P2 + p LPI Full ms [150.00-1500.00]
Scan #: 1
RT: 6.11
m/z Intensity   Relative    Resolution  Charge  Baseline

  150.0119         67.3     0.00    152545.44       0.00       26.27
  150.0153         59.3     0.00    269991.72       0.00       26.28
  150.0156         66.1     0.00    288504.16       0.00       26.28
  150.0161         67.2     0.00    172425.14       0.00       26.28
  150.0330         78.9     0.00    167957.34       0.00       26.32
  150.0485         75.0     0.00    208783.14       0.00       26.35
  150.0603        166.2     0.00    220081.53       0.00       26.37

res中的值现在是浮点数:

 [[150.019, 67.3, 0.0, 152545.4, 0.0, 26.27], [150.0153, 59.3, 0.0, 2691.72, 0.0, 26.28],
 [150.0156, 6.1, 0.0, 28504.16, 0.0, 26.28], [150.0161, 67.2, 0.0, 172425.14, 0.0, 26.28],
 [150.03, 78.9, 0.0, 167957.34, 0.0, 26.32], [150.0485, 75.0, 0.0, 208783.14, 0.0, 26.35],
 [150.0603, 16.2, 0.0, 2081.53, 0.0, 26.37]