Question

在past posting中，我询问了 Bash 中的命令，以便按行对齐文本列。我已经清楚地知道，期望的任务（即，按行排列不同大小和内容的文本列）比最初预期的要复杂得多，并且建议的answer虽然可以接受过去的帖子，但是不够在大多数经验数据集上。因此，我想在以下伪代码上查询社区。具体来说，我想知道以下伪代码是否以及以何种方式进行优化。

假设一个文件包含 n 个字符串列。某些字符串可能会丢失，其他字符串可能会重复。最长的列可能不是文件中列出的第一列，但应该是参考列。必须保持此参考列的行的顺序。

> cat file  # where n=3; first row contains column headers
CL1 CL2 CL3
foo foo bar
bar baz qux
baz qux
qux foo
    bar

伪代码尝试1（完全不合适）：</ p>

Shuffle columns so that columns ordered by size (i.e., longest column is first in matrix)
Rownames = strings of first column (i.e., of longest column)
For rownames
  For (colname among columns 2:end)
    if (string in current cell == rowname) {keep string in location}
    if (string in current cell != rowname) {
      if (string in current cell == rowname of next row) {add row to bottom of table; move each string of current column one row down}
      if (string in current cell != rowname of next row) {add row to bottom of table; move each string of all other columns one row down}
    }

按大小排序列：

> cat file_columns_ordered_by_size
CL2 CL1 CL3
foo foo bar
baz bar qux
qux baz 
foo qux 
bar

求输出：

> my_code_here file_columns_ordered_by_size
CL2 CL1 CL3
foo foo 
    bar bar
baz baz    
qux qux qux
foo
bar

Answer 1

编辑：呃，这不会产生你想要的输出。我想我不明白这个问题。无论如何，也许它会有所帮助。

如果你不介意将整个表格插入内存，那么关联数组（哈希）就可以了。（或者您可以使用树，地图，字典等）每列都有一个，将字符串（在该列的单元格中找到）映射到该列中找到该字符串的次数。让我们在列标题后命名哈希值。在啜饮之后，他们会看起来像这样：

CL2 = {'foo':2, 'baz':1, 'bar':1, 'qux':1}
CL1 = {'foo':1, 'baz':1, 'bar':1, 'qux':1}
CL3 = {'bar':1, 'qux':1}

# Store the columns in an array
columnCounts = [CL2, CL1, CL3]

然后编写一个产生输出的循环，在每次迭代时从关联数组中删除：

while (columnCounts still has at least one non-empty hash) {
    key = the hash-key that is present in most (a plurality) of the hashes
    for each hash in columnCounts {
        if the key is in the hash {
            print key
            Decrement hash[key]
        }
        else {
            print whitespace
        }
    }

    print newline
}

对齐不同大小和内容的文本列

1 个答案: