Question

EDITED

我有大约50x9Gb .mer个文件，如下所示：

"xxxxx";"123\t123\t123\v234\t234\v234\t224\t234\v"
"yyyyy";"123\t234\t224\v234\t234\v234\t224\t234\v"
"zzzzz";"123\t456\t565\v234\t774"

uuid后跟";" ，然后可能是额外的标签条目，然后是其他多个标签条目的垂直制表符分隔列表，全部用引号括起来。我在这里将它们显示为3位数字，但实际上它们是可变长度字符串，可以包含加倍的引号""。

我需要把它们变成这个：

xxxxx\t123\t123\t123
xxxxx\t234\t234
xxxxx\t234\t224\t234
yyyyy\t123\t234\t224
yyyyy\t234\t234
yyyyy\t234\t224\t234
zzzzz\t123\t456\t565
zzzzz\t234\t774

也就是说，拆分垂直制表符上的线条，在每行前面加上它来自的第一个字段。

目前，我使用的是一个正面的正则表达式，它至少可以正常工作，但需要多次运行和手动检查。

如何使用awk或sed执行此操作？我已经尝试调整下面的当前答案，但我很难找到P和D后缀的含义。

（注意：我在Windows上使用GitBash，所以我想这就是gnu sed和awk？）

Answer 1

这可能适合你（GNU sed）：

sed -r 's/^((\S*\t)\S*)\v/\1\n\2/;P;D' file

用换行符替换每个\v，第一个字段和一个标签。打印并删除第一行并重复。

编辑：根据新问题;

sed -r '/\n/!s/"(")?/\1/g;/\n/!s/;/\t/;s/^((\S*\t)[^\v]*)\v/\1\n\2/;/\t$/!P;D' file

删除任何单个双引号（用双引号替换双引号）并用制表符替换半冒号。然后用换行符替换任何\v，第一个字段和标签重复。

Answer 2

awk -F';' -v OFS='\t'              #set Field separator is ';',
    '{for(i=1;i<=NF;i++)           #then we have 2 fields, remove leading and trailing doubled qoutes
        gsub(/^"|"$/,"",$i)
      c=split($2,a,"\v")           #split by vertical tab, save result in array 'a'
      for(i=1;i<=c;i++)            #for each element in a, if it is not empty, print field1 (the uuid)
        if(a[i])print $1,a[i]}' file #and the element, separated by Tab

解释是内联的。

输出：

xxxxx   123     123     123
xxxxx   234     234
xxxxx   234     224     234
yyyyy   123     234     224
yyyyy   234     234
yyyyy   234     224     234
zzzzz   123     456     565
zzzzz   234     774

Answer 3

您可以将此awk命令用于此输出：

awk 'BEGIN{FS=OFS="\t"} n = split($2, a, "\x0b") {
        for (i=1; i<=n; i++) print $1, a[i]}' file

195a664e-e0d0-4488-99d6-5504f9178115    1234
195a664e-e0d0-4488-99d6-5504f9178115    1412
195a664e-e0d0-4488-99d6-5504f9178115    1231
195a664e-e0d0-4488-99d6-5504f9178115    4324
195a664e-e0d0-4488-99d6-5504f9178115    1421
195a664e-e0d0-4488-99d6-5504f9178115    3214
a1d61289-7864-40e6-83a7-8bdb708c459e    1412
a1d61289-7864-40e6-83a7-8bdb708c459e    6645
a1d61289-7864-40e6-83a7-8bdb708c459e    5334
a1d61289-7864-40e6-83a7-8bdb708c459e    3453
a1d61289-7864-40e6-83a7-8bdb708c459e    5453

工作原理：

BEGIN{FS=OFS="\t"}       # sets input and output field separator as tab
n = split($2, a, "\x0b") # splits second field using Hex 0B (ASCII 11) i.e. vertical tab
for (i=1; i<=n; i++) ... # prints pair of field 1 with each item from split array a

Answer 4

gnu sed

 sed  's/"\|..$//g;s/;/\t/;:r;s/^\([^\t]*\)\t\(.*\)\\v/\1\t\2\n\1\t/;t r;s/\\t/\t/g;'  YourFile

通过第一个“字段”+标签+清除记录中的额外字符来递归替换\ v

Answer 5

使用awk

的另一种解决方案

awk '
    BEGIN{FS="[\v;]"}
   {
       gsub("[\"]",""); 
       for(i=2; i<=NF; ++i) 
           if($i) printf "%s\t%s\n", $1, $i;
   }' file.mer

使用sed

的另一种解决方案

sed -r 's/\v\n/\v/g; s/"//g;
    :a; s/([^;]*);([^\v]*)\v/\1;\2\n\1;/g; ta; 
    s/;/\t/g;' file.mer | sed -r '/^[^\t]+\t$/d'

你明白了，

xxxxx   123 123 123
xxxxx   234 234
xxxxx   234 224 234
yyyyy   123 234 224
yyyyy   234 234
yyyyy   234 224 234
zzzzz   123 456 565
zzzzz   234 774

Answer 6

好吧，我一直等到肯特的答案被接受并获得赏金，因为问题是关于awk / sed。因此，我的答案可能有点偏离主题，但无论如何，这里是我的Java解决方案，我只是为了卡塔的乐趣。

MER输入文件生成器：

我认为用随机值生成一些示例输入文件会很好。每行包含

一个UUID，
0-9组，由垂直标签分隔，
在每个组中，1-4个字符串，由水平制表符分隔，
每个字符串由1-20个字符组成，其中双引号被其他双引号转义，即""。

我认为这足以获得一些好的测试数据。

package de.scrum_master.stackoverflow;

import org.apache.commons.lang.RandomStringUtils;

import java.io.BufferedWriter;
import java.io.FileWriter;
import java.io.IOException;
import java.io.PrintWriter;
import java.util.Random;
import java.util.UUID;

public class RandomFileGenerator {
  private static final int BUFFER_SIZE = 1024 * 1024;
  private final static Random RANDOM = new Random();
  private final static char VERTICAL_TAB = '\u000b';
  private final static char[] LEGAL_CHARS =
    "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyzäöüÄÖÜß. -\""
      .toCharArray();

  public static void main(String[] args) throws IOException {
    long startTime = System.currentTimeMillis();
//    final long maxOutputSize = 9L * 1024 * 1024 * 1024;
//    final String outputFile = "src/main/resources/sample-9gb.mer";
    final long maxOutputSize = 1L * 1024 * 1024;
    final String outputFile = "src/main/resources/sample-1mb.mer";
    long totalOutputSize = 0;
    long lineCount = 0;
    String line;
    try (PrintWriter writer = new PrintWriter(new BufferedWriter(new FileWriter(outputFile), BUFFER_SIZE))) {
      while (totalOutputSize < maxOutputSize) {
        line = generateLine();
        writer.println(generateLine());
        totalOutputSize += line.length() + 1;
        lineCount++;
      }
    }
    System.out.println(lineCount);
    System.out.println(totalOutputSize);
    System.out.println((System.currentTimeMillis() - startTime) / 1000.0);
  }

  private static String generateLine() {
    StringBuilder buffer = new StringBuilder();
    buffer
      .append('"')
      .append(UUID.randomUUID().toString())
      .append("\";\"");
    int numItems = RANDOM.nextInt(10);
    for (int i = 0; i < numItems; i++) {
      int numSubItems = 1 + RANDOM.nextInt(4);
      for (int j = 0; j < numSubItems; j++) {
        buffer.append(
          RandomStringUtils.random(1 + RANDOM.nextInt(20), 0, LEGAL_CHARS.length, false, false, LEGAL_CHARS)
            .replaceAll("\"", "\"\"")
        );
        if (j + 1 < numSubItems)
          buffer.append('\t');
      }
      if (i + 1 < numItems) {
        buffer.append(VERTICAL_TAB);
      }
    }
    buffer.append('"');
    return buffer.toString();
  }

}

您可以看到，创建所需文件大小的测试文件很容易，例如

1 MB：maxOutputSize = 1L * 1024 * 1024
9 GB：maxOutputSize = 9L * 1024 * 1024 * 1024

我主要使用较小的一个来检查开发过程中的算法，以及用于性能调整的非常大的算法。

4种不同版本的文件分割器：

此处显示的变体使用不同的方法，但共同点是它们通过BufferedReader从reader.lines()读取Java流。从流切换到简单的for循环使其变慢，BTW。所有解决方案都将结果写入PrintWriter。

reader.lines().forEach()然后正则表达式匹配+拆分。该解决方案在可读性，简洁性和性能之间取得了最佳平衡。
reader.lines().flatMap()，即在UUID之后使用垂直制表符分隔的组的子流，也使用正则表达式匹配+拆分。这个解决方案也非常简短和优雅，但比＃1更难阅读，也慢了约15％。
因为像replace()和split()这样的正则表达式匹配调用非常昂贵，所以我开发了一种解决方案，它反复遍历字符串并使用indexOf()和substring()而不是正则表达式。这比＃1和＃2快得多，但代码更难以我开始不喜欢的方式阅读。只有在性能非常重要时才应该这样做，即如果定期使用文件分割器。对于一次性解决方案或者如果它每月运行一次，我认为从可维护性的角度来看它并不值得。
＃3的进一步优化版本，它避免了一些更多的开销，并且再次快一点，但不是很大。现在代码确实需要源代码注释，以便向读者传达算法的功能。从干净的代码角度来看，这是一场噩梦。（不要在家里这样做，孩子们！）

package de.scrum_master.stackoverflow;

import java.io.*;
import java.util.Arrays;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class FileSplitter {
  private static final int BUFFER_SIZE = 1024 * 1024;
  private static final Pattern LINE_PATTERN = Pattern.compile("^\"([^\"]+)\";\"(.*)\"$");
  private final static char VERTICAL_TAB = '\u000b';

  public static void main(String[] args) throws IOException {
    long startTime = System.currentTimeMillis();
    String inputFile = "src/main/resources/sample-9gb.mer";
    String outputFile = inputFile.replaceFirst("mer$", "txt");
    try (
      BufferedReader reader = new BufferedReader(new FileReader(inputFile), BUFFER_SIZE);
      PrintWriter writer = new PrintWriter(new BufferedWriter(new FileWriter(outputFile), BUFFER_SIZE))
    ) {
//      forEachVariant(reader, writer);
//      flatMapVariant(reader, writer);
      noRegexSimpleVariant(reader, writer);
//      noRegexOptimisedVariant(reader, writer);
    }
    System.out.println((System.currentTimeMillis() - startTime) / 1000.0);
  }

  private static void forEachVariant(BufferedReader reader, PrintWriter writer) {
    Matcher matcher = LINE_PATTERN.matcher("dummy");
    reader.lines()
      .forEach(line -> {
        matcher.reset(line).matches();
        for (String record : matcher.group(2).replace("\"\"", "\"").split("\\v"))
          writer.println(matcher.group(1) + "\t" + record);
      });
  }

  private static void flatMapVariant(BufferedReader reader, PrintWriter writer) {
    Matcher matcher = LINE_PATTERN.matcher("dummy");
    reader.lines()
      .flatMap(line -> {
        matcher.reset(line).matches();
        return Arrays
          .stream(matcher.group(2).replace("\"\"", "\"").split("\\v"))
          .map(record -> matcher.group(1) + "\t" + record);
      })
      .forEach(writer::println);
  }

  private static void noRegexSimpleVariant(BufferedReader reader, PrintWriter writer) {
    reader.lines()
      .forEach(line -> {
        final int lineLength = line.length();

        // UUID + '\t'
        int indexLeft = 1;
        int indexRight = line.indexOf('"', indexLeft);
        final String uuid = line.substring(indexLeft, indexRight) + "\t";

        indexLeft = indexRight + 3;
        String record;
        int quoteIndex;
        while (indexLeft < lineLength) {
          writer.print(uuid);
          indexRight = line.indexOf(VERTICAL_TAB, indexLeft);
          if (indexRight == -1)
            indexRight = lineLength - 1;

          while (indexLeft < indexRight) {
            quoteIndex = line.indexOf('"', indexLeft);
            if (quoteIndex == -1 || quoteIndex >= indexRight)
              quoteIndex = indexRight;
            else
              quoteIndex++;
            record = line.substring(indexLeft, quoteIndex);
            writer.print(record);
            indexLeft = quoteIndex + 1;
          }
          writer.println();
          indexLeft = indexRight + 1;
        }
      });
  }

  private static void noRegexOptimisedVariant(BufferedReader reader, PrintWriter writer) throws IOException {
    reader.lines()
      .forEach(line -> {
        // UUID + '\t'
        int indexLeft = 1;
        int indexRight = line.indexOf('"', indexLeft);
        final String uuid = line.substring(indexLeft, indexRight) + "\t";

        // Skip '";"' after UUID
        indexLeft = indexRight + 3;

        final int lineLength = line.length();
        String recordChunk;
        int quoteIndex;

        // If search for '"' has once reached end of line, search no more
        boolean doQuoteSearch = true;

        // Iterate over records per UUID, separated by vertical tab
        while (indexLeft < lineLength) {
          writer.print(uuid);
          indexRight = line.indexOf(VERTICAL_TAB, indexLeft);
          if (indexRight == -1)
            indexRight = lineLength - 1;

          // Search for '""' within record incrementally, + replace each of them by '"'.
          // BTW, if '"' is found, it actually always will be an escaped '""'.
          while (indexLeft < indexRight) {
            if (doQuoteSearch) {
              // Only search for quotes if we never reached the end of line before
              quoteIndex = line.indexOf('"', indexLeft);
              assert quoteIndex != -1;
              if (quoteIndex >= lineLength - 1)
                doQuoteSearch = false;
              if (quoteIndex >= indexRight)
                quoteIndex = indexRight;
              else
                quoteIndex++;
            }
            else {
              // No more '"' within record
              quoteIndex = indexRight;
            }

            // Write record chunk, skipping 2nd '"'
            recordChunk = line.substring(indexLeft, quoteIndex);
            writer.print(recordChunk);
            indexLeft = quoteIndex + 1;
          }

          // Do not forget newline before reading next line/UUID
          writer.println();
          indexLeft = indexRight + 1;
        }
      });
  }
}

更新了awk脚本：

此外：每个Java解决方案都会写出一个没有任何内容的UUID，以防输入文件中没有任何内容。这很容易避免，但我故意这样做。这是稍微更新的awk脚本（基于Dave，但也替换"" "）的唯一区别，我用它作为基准：

#!/usr/bin/awk

{
  for(i=1;i<=NF;i++) {
    gsub(/^"|"$/,"",$i)
    gsub(/""/,"\"",$i)
  }
  c=split($2,a,"\\v")
  for(i=1;i<=c;i++)
    print $1,a[i]
}

效果结果：

我测量了解析和写作性能。

解析意味着从磁盘读取9 GB文件并将其拆分，但将输出写入/ dev / null或根本不写入。
写入意味着读取相同的9 GB文件并将其写回到同一磁盘分区（混合HD + SSD类型），即可以通过写入另一个物理磁盘进一步优化。输出文件的大小为18 GB。
1. 读取文件，分成行但不解析行：66 s
2. awk中
  - 仅解析：533 s
  - 解析+写作：683 s
3. reader.lines().forEach()然后正则表达式匹配+拆分
  - 仅解析：212 s
  - 解析+写作：425 s
4. reader.lines().flatMap()，即使用子流
  - 仅解析：245 s
  - 解析+写作：未测量
5. 不使用正则表达式，但使用String.replace("\"\"", "\"")（此处未在代码中显示）
  - 仅解析：154 s
  - 解析+写作：369 s
6. 没有正则表达式，没有replace()，简单版本
  - 仅解析：86 s
  - 解析+写作：342 s
7. 没有正则表达式，没有replace()，优化版本
  - 仅解析：84 s
  - 解析+写作：342 s

对于冗长的论文感到抱歉，但是我想与其他人分享我的发现，阅读问题和其他答案，推测Java（或C？）是否可能比awk更快 - 是的，它是相当多的，但是不是一个数量级，因为磁盘性能也是一个因素。我认为这是对那些倾向于过度优化以进行优化的人的警告。如果你走得太远就不值得，只是努力在努力，可读性和性能之间找到最佳点。阿门。

拆分文本行，同时附加前缀

6 个答案: