我认为应该很容易解决一个小的生物信息学问题。与“基因型定相”有关。但我不确定如何解决它。 在下面的摘录中,第一列是标识符,后续列是标有“a”或“b”的二进制基因型。 “ - ”表示缺失值。
Si_gnF.scaffold10533.53688bp_tag414456 b a a b b a b a a a b a b b a b a a b b a a b b
Si_gnF.scaffold10533.76297bp_tag414484 a b b a a b a b b b a b a a b a b b a - b b a a
Si_gnF.scaffold10533.98416bp_tag414526 a b b a a b a b b b a b a a b a b b a a b b a a
Si_gnF.scaffold10534.48805bp_tag414546 b a a b a b a b b b b b b a a a a b a b b b b a
Si_gnF.scaffold10535.1091787bp_tag414684 a a a b b a a a b a b a a a a b b b a a b b a a
Si_gnF.scaffold10535.1151107bp_tag414765 b b b a a b b b a b a - b b b a a a b b a a b b
Si_gnF.scaffold10535.1220879bp_tag414877 a a a b b a a a b a b a a a a b b b a a b b a a
Si_gnF.scaffold10535.1304464bp_tag414988 b b b a a b b b a b a b b b b a a a b b a a b b
Si_gnF.scaffold10535.1347462bp_tag415047 b b b a a b b b a b a b b b b a a a b b a a b b
Si_gnF.scaffold10535.1379804bp_tag415090 b b b a a b b b a b a b b b b a a a b b a a b b
Si_gnF.scaffold10535.1540335bp_tag415345 a a a b b a a a b a b a a a a b b b a a b b a a
Si_gnF.scaffold10535.1585442bp_tag415410 a a a b b a a a b a b a a a a b b b a a b b a a
Si_gnF.scaffold10535.1609908bp_tag415431 b b b a a b a b a b a b b b b a a a b b a a b b
Si_gnF.scaffold10535.1711158bp_tag415567 b b b a a b b b a b a b b b b a a a b b a a b b
Si_gnF.scaffold10535.1744394bp_tag415609 b b b a a b b b a b a b b b b a a a b b a a b b
Si_gnF.scaffold10535.1751886bp_tag415620 a a a b b a a a b a b a a a a b b b a a b b a a
Si_gnF.scaffold10535.1752774bp_tag415622 a a a b b a a a b a b a a a a b b b a a b b a a
Si_gnF.scaffold10535.1789478bp_tag415675 b b - a a b b b a b a b b b b a a a b b a a b b
Si_gnF.scaffold10535.1800135bp_tag415687 b b b a a b b b a b a b b b b a a a b b a a b b
Si_gnF.scaffold10535.1885424bp_tag415814 a a a b b a a a b a b a a a a b b b a a b b a a
基本上,我想尽量减少行之间的差异数量。 (我无法编辑单个列,但可以在整行上翻转标签)。前四行的结果是:
Si_gnF.scaffold10533.53688bp_tag414456 b a a b b a b a a a b a b b a b a a b b a a b b
Si_gnF.scaffold10533.76297bp_tag414484 b a a b b a b a a a b a b b a b a a b - a a b b <-- this one flipped
Si_gnF.scaffold10533.98416bp_tag414526 b a a b b a b a a a b a b b a b a a b b a a b b <-- this one flipped
Si_gnF.scaffold10533.53688bp_tag414456 b a a b b a b a a a b a b b a b a a b b a a b b
作为第一步,我需要进行成对比较。但是,量化差异的好方法是什么,以便我知道必须翻转标签的哪些行? (2个连续的行很少匹配100%;可能存在多个(甚至很多)不匹配以及缺失值)。
(理想情况下是红宝石或R)
答案 0 :(得分:2)
您可以使用Levenshtein算法量化两个字符串之间的差异。一种方法:
require 'text' # See http://rubygems.org/gems/text
lines # => a array with each line
def compare(line1, line2)
Text::Levenshtein.distance(line1.sub(/.*\s/, '').sort,
line2.sub(/.*\s/, '').sort)
end
compare(lines[0], lines[1]) # => 1 (one value different)
(如果“a b a a”不等于“a a a b”,请从方法中删除sort
。)