替换文本文档中的一行最有效的方法?

时间:2016-02-06 16:02:33

标签: text-processing file string c

我正在学习使用C编写Unix代码。到目前为止,我编写了代码来查找要替换的行的第一个字节的索引。问题是有时候,替换该行的字节数可能大于该行上已有的字节数。在这种情况下,代码开始覆盖下一行。我提出了两个标准解决方案:

a)我不是试图就地编辑文件,而是将整个文件复制到内存中,如有必要,可以通过移动所有字节来编辑它,然后将其重写回文件。

b)仅将我想要的文件行复制到内存并进行编辑。

这两项建议都不能很好地扩展。而且我不想对行大小施加任何限制(就像每行必须是50个字节或者其他东西)。有没有有效的方法来更换线路?任何帮助将不胜感激。

3 个答案:

答案 0 :(得分:2)

对于文本文件,您总是必须“假脱机”它们,因为要删除/插入/替换的文本几乎总是大于或小于那里。

“假脱机”意味着在文件目录中打开临时文件,读取原始文件并将其写入临时文件,停止替换/插入/删除开始的位置,执行操作并将余数复制到输出中。如果一切顺利,则取消链接原始文件并将新文件重命名为旧文件。

Ps:如果你不想对行大小有限制,那么你必须使用fgetc/fputc逐字符处理(没有汗水; C可以非常快,你的磁盘允许)。 / p>

答案 1 :(得分:1)

上个月我实际上遇到了这个问题,日志文件已增长到30 GB且一行。像sed这样的工具,perl想要消耗所有可用的内存来对它们做任何事情。从技术上讲,您的解决方案都不能很好地扩展。但在实践中,它们很好,(b)是首选。您应该使用缓冲区大小为8kB的fgets并迭代,直到最后一个字符是换行符或者您已达到EOF。在我的灵魂中,我使用perl的sysread函数,一次读取16 kB块。

来自记忆:

#define BUF_SZ 16383
char *buf = alloca(BUF_SZ + 1);
infile = fopen(...);
while (!feof(infile) && fgets(buf, BUF_SZ, infile) != NULL) { 
   readmore = (buf[0] != '\0' && buf[ strlen(buf)-1 ] != '\n');
   /* other processing
   .
   .
   */
   if (readmore) {
     /* apply different strategies for dealing with buf */
   }
}

我认为策略实际上取决于您要做的事情。如果你想删除该行或截断它,但你只需要匹配行的开头,那么它非常简单(没有特殊代码)。但是,如果你需要做一个可能延伸超过前16kB的长模式匹配,那么你必须做一些事情,比如将最后n个字节(其中n是搜索模式的最大化大小)移动到buf的开头 并做下一个读入& buf [n]。

您将输出到新的文件句柄,当所有内容完成并正确完成后,您取消链接第一个文件,并新文件重命名为旧文件。同时研究mktemp在同一目录中创建临时文件,以及atexit调用以便在出现错误时进行清理。

答案 2 :(得分:0)

替换文件中的一行文本的最有效方法取决于许多事项。 [1] 希望高效搜索和替换时的主要问题是为了最小化文件读/写的数量,因为文件I / O通常比内存操作慢一个数量级。当搜索和替换字符串具有确切的字符数时,会发生这种简单的情况。在那种情况下,只有在这种情况下,您可以在不必编写第二个(或临时文件)的情况下对文本进行文件内替换。

考虑到文件I / O效率,执行搜索/替换的最有效(最快)方式是mmap整个文件或使用sendfile。两者都可以利用文本块的内核空间复制,这通常会对用户空间复制操作产生显着的改进。这两者都不困难。下一个最佳选择是使用缓冲读取将文件的全部内容读入内存,然后在内存缓冲区上执行搜索,以识别要更改内容的位置(地址)。然后,您可以将缓冲区逐渐写入第三个文件,在搜索原始缓冲区期间识别的每个所需位置写入替换文本。

考虑到您的示例中的问题,不需要立即将整个文件读入内存(即使文件很少大于INT_MAX字节)。对于存储器存储和效率都是挑战的嵌入式系统等,您只需设置一个满足您的大小限制的任意大小的块大小,然后在一个存储器中读取一块内存。时间,执行搜索/替换(根据需要处理角落情况,例如,少于包含搜索字符串的第一部分的字符的搜索长度保留在给定块中等等)

关键是尽量减少返回驱动器的次数以获取更多信息,或者从缓冲区写入驱动器。所以通常块越大越好。

以下是一个简短的最小示例,它将按用户指定的块大小的内存块读取给定文件(如果小于请求的块大小,则受文件大小本身的限制)。代码只是从文件中读取数据块,使用memchr来查找搜索词中的每个起始字符。找到开始字符时,memcmp用于检查从匹配字符开始的内存的搜索项长度。根据比较是匹配还是失败,更新各种索引并继续搜索。

在这种情况下,数据输出到新文件(stdout),每次在内存块中找到匹配的术语,或者如果没有找到匹配的术语,则在块的末尾输出数据。此代码可以进一步优化,并且可以始终添加额外的健全性检查。查看示例,如果您有任何疑问,请告诉我。这些都不困难,但是需要一些单独的索引(例如,当前缓冲区位置和最后读/写位置)来执行搜索/替换。下面的示例使用一个文本文件,其中包含几个段落,用于讨论“人身伤害”,其中“卫生”代替“伤害”,结果输出到stdout

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <limits.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <unistd.h>

#define BUFSZ (1 << 20)  /* default max block size (1M) */

void *find_rplc_file (char *srch, char *rplc, FILE *ifp, FILE *ofp, long blksz);

int main (int argc, char **argv) {

    if (argc != 4) {
        fprintf (stderr, "error: insufficient input.\n"
                         "usage: %s infile <search> <replace>\n", argv[0]);
        return 1;
    }

    FILE *ifp = fopen (argv[1], "rb");

    if (!ifp) {
        fprintf (stderr, "error: file open failed '%s'.\n", argv[1]);
        return 1;
    }

    if (!find_rplc_file (argv[2], argv[3], ifp, stdout, BUFSZ)) {
        fprintf (stderr, "error: find/replace failure.\n");
        return 1;
    }
    putchar ('\n');

    fclose (ifp);

    return 0;
}

void *find_rplc_file (char *srch, char *rplc, FILE *ifp, FILE *ofp, long blksz)
{
    if (!ifp || !srch || !rplc || !blksz) return NULL;

    char *fb, *filebuf = NULL;
    size_t offset = 0, nbytes = 0, readsz = 0, rlen, slen;
    long  bytecnt = 0, readpos = 0, size = 0;

    rlen = strlen (rplc);   /* length of search/replace text */
    slen = strlen (srch);

    fseek (ifp, 0, SEEK_END);
    if ((size = ftell (ifp)) == -1) {  /* get file length */
        fprintf (stderr, "error: unable to determine file length.\n");
        return NULL;
    }
    fseek (ifp, 0, SEEK_SET);

    /* limit blksz to less or INT_MAX or blksz */
    blksz = blksz > INT_MAX ? INT_MAX : blksz;

    /* validate blksz does not exceed file size */
    readsz = blksz > size ? size : blksz;

    /* allocate memory for filebuf */
    if (!(filebuf = calloc (readsz, sizeof *filebuf))) {
        fprintf (stderr, "error: virtual memory exhausted.\n");
        return NULL;
    }

    /* read entire file readsz bytes at a time */
    while ((nbytes = fread (filebuf, sizeof *filebuf, readsz, ifp))) {

        if (nbytes != readsz) fprintf (stderr, "warning: short read.\n");

        readpos = 0;    /* initialize read position & pointer */
        fb = filebuf;

        /* for each occurrence of 1st char of search term */
        while ((fb = memchr (fb, *srch, nbytes - offset))) {
            /* set current offset in buffer */
            offset = fb - filebuf;
            /* if less than length of search term remains */
            if (offset + slen > nbytes) {
                nbytes = offset; /* set nbytes to current offset */
                /* reset file pointer to account for nbytes reduction */
                fseek (ifp, bytecnt + nbytes, SEEK_SET);
                goto getnext;    /* read next block from here */
            }
            /* otherwise compare fb to search term */
            if (memcmp (srch, fb, slen) == 0) {
                /* if term found, write prior buffer to output file */
                fwrite (filebuf + readpos, sizeof *filebuf, 
                        offset - readpos, ofp);
                /* write replacement text */
                fwrite (rplc, sizeof *rplc, rlen, ofp);
                /* set next readpos to 1st char following search term */
                readpos = offset + slen;
            }
            fb++;   /* advance fb pointer for next memchr search */
        }

    getnext:
        bytecnt += nbytes;  /* increment bytecnt with bytes searched */

        /* write remaining buffer to output file */
        fwrite (filebuf + readpos, sizeof *filebuf, 
                nbytes - readpos, ofp);

        /* check file complete */
        if (bytecnt == size) break;

        /* set next read size (either blksz or remaining chars < blksz) */
        readsz = size - bytecnt > blksz ? blksz : size - bytecnt;
    }

    /* validate all bytes successfully read */
    if ((long)bytecnt != size) {
        fprintf (stderr, "error: file read failed.\n");
        return NULL;
    }

    free (filebuf); /* free filebuf */

    return srch;   /* return something other than NULL for success */
}

示例输入

$ cat dat/damages.txt
Personal injury damage awards are unliquidated
and are not capable of certain measurement; thus, the
jury has broad discretion in assessing the amount of
damages in a personal injury case. Yet, at the same
time, a factual sufficiency review insures that the
evidence supports the jury's award; and, although
difficult, the law requires appellate courts to conduct
factual sufficiency reviews on damage awards in
personal injury cases. Thus, while a jury has latitude in
assessing intangible damages in personal injury cases,
a jury's damage award does not escape the scrutiny of
appellate review.

Because Texas law applies no physical manifestation
rule to restrict wrongful death recoveries, a
trial court in a death case is prudent when it chooses
to submit the issues of mental anguish and loss of
society and companionship. While there is a
presumption of mental anguish for the wrongful death
beneficiary, the Texas Supreme Court has not indicated
that reviewing courts should presume that the mental
anguish is sufficient to support a large award. Testimony
that proves the beneficiary suffered severe mental
anguish or severe grief should be a significant and
sometimes determining factor in a factual sufficiency
analysis of large non-pecuniary damage awards.

搜索“伤害”替换“卫生”

$ ./bin/fread_blks_min dat/damages.txt "injury" "hygiene"
Personal hygiene damage awards are unliquidated
and are not capable of certain measurement; thus, the
jury has broad discretion in assessing the amount of
damages in a personal hygiene case. Yet, at the same
time, a factual sufficiency review insures that the
evidence supports the jury's award; and, although
difficult, the law requires appellate courts to conduct
factual sufficiency reviews on damage awards in
personal hygiene cases. Thus, while a jury has latitude in
assessing intangible damages in personal hygiene cases,
a jury's damage award does not escape the scrutiny of
appellate review.

Because Texas law applies no physical manifestation
rule to restrict wrongful death recoveries, a
trial court in a death case is prudent when it chooses
to submit the issues of mental anguish and loss of
society and companionship. While there is a
presumption of mental anguish for the wrongful death
beneficiary, the Texas Supreme Court has not indicated
that reviewing courts should presume that the mental
anguish is sufficient to support a large award. Testimony
that proves the beneficiary suffered severe mental
anguish or severe grief should be a significant and
sometimes determining factor in a factual sufficiency
analysis of large non-pecuniary damage awards.

脚注[1]:最有效的方法是使用为sedawk等任务设计的一个shell工具,这两个工具都有多年的开发和内置了相当不错的文本处理优化。