Question

我正在编写读取包含DNA碱基的巨大文本文件的代码，我需要能够提取特定部分。该文件如下所示：

TGTTCCAGGCTGTCAGATGCTAACCTGGGG
TCACTGGGGGTGTGCGTGCTGCTCCAGCCT
GTTCCAGGATATCAGATGCTCACCTGGGGG

...

每行是30个字符。

我有一个单独的文件来指示这些部分，这意味着我有一个开始值和一个结束值。因此，对于每个开始和结束值，我需要在文件中提取相应的字符串。例如，如果我有开始 = 10，结束 = 45，则需要存储以第一行（C）的第10个字符开头并以在单独的临时文件中第二行（C）的第15个字符。

我尝试将fread函数（如下所示）用于具有上述字母行的测试文件。参数为开始 = 1，结束 = 90，结果文件如下：

TGTTCCAGGCTGTCAGATGCTAACCTGGGG
TCACTGGGGGTGTGCGTGCTGCTCCAGCCT
GTTCCAGGATATCAGATGCTCACCTGGG™eRV

每次运行都会在末尾给出随机字符。

代码：


FILE* fp;
fp=fopen(filename, "r");
if (fp==NULL) puts("Failed to open file");

int start=1, end=90;
char string[end-start+2]; //characters from start to end = end-start+1

fseek(fp, start-1, SEEK_SET);

fread(exon,1, end-start+1, fp);

FILE* tp;
tp=fopen("exon", "w");
if (tp==NULL) puts("Failed to make tmp file");

fprintf(tp, "%s\n", string);
fclose(tp);

我不明白fread如何处理\ n字符，因此我尝试将其替换为以下内容：

int i=0;
char ch;
while (!feof(fp))
{
            ch=fgetc(fp);

            if (ch != '\n') 
            {
                string[i]=ch;
                i++;
                if (i==end-start) break;
            }

}
string[end-start+1]='\0';

它创建了以下文件： TGTTCCAGGCTGTCAGATGCTAACCTGGGGTCACTGGGGGTGTGCGTGCTGCTCCAGCCTGTTCCAGGATATCAGATGCTCACCTGGGGô

（没有任何换行符，我不介意）。每次运行时，我都会得到一个不同的随机字符，而不是'G'。

我在做什么错？有没有一种方法可以通过fread或其他功能来完成？

谢谢。

Answer 1

我已经修改了您的代码，并在其中添加了注释以供解释。

请仔细检查。您已经忽略了错误检查，代码中几乎没有未定义的变量。

如果失败，我已经从if区块返回，那么goto`会更合适。

有关向start和end添加1个字符还是2个字符的信息，请参考this comment。

#include <stdio.h>
#include <string.h>
#include <stdlib.h>

int main()
{
        FILE* fp;
        // fp = fopen(filename, "r");
        // since the filename is undeclared i have used hard coded file name
        fp = fopen("dna.txt", "r");
        // Nothing wrong in performing error checking
        if (fp == NULL) {
                puts("Failed to open file");
                return -1; 
        }

        // Make sure start is not 0 if you want to use indices starting from 1
        int start = 1, end = 90; 

        // I would adjust the start and end index by adding count of '\n' or '\r\n' to the start and end
        // Here I am adjusting for '\n' i.e 1 char
        // since you have 30 chars so hardcoding it.
        int m = 1; // m depends on whether it is \n or \r\n
                   // 1 for \n and 2 for \r\n
        --start; --end; // adjusting indexes to be 0 based
        if (start != 0)
                start = start + (start / 30) * m;   // start will be 0
        if (end != 0)
                end = end + (end / 30) * m;         // start will be 93

        // lets declare the chars to read
        int char_to_read = end - start + 1;

        // need only 1 extra char to append null char
        // If start and end is going to change, then i would suggest using malloc instead of static buffer
        // because compiler cannot predict the memory to allocate to the buffer if it is dependent on external factor
        // char string[char_to_read + 1]; //characters from start to end = end-start+1

        char *string = malloc(char_to_read + 1); 
        if (string == NULL) {
                printf("malloc failed\n");
                fclose(fp);
                return -2;
        }

        // zero the buffer
        memset(string, 0, char_to_read + 1); 

        int rc = fseek(fp, start, SEEK_SET);
        if (rc == -1) {
                printf("fseek failed");
                fclose(fp);
                return -1;
        }

        // exon is not defined, and btw we wanted to read in string.
        int bytes_read = fread(string, 1, char_to_read, fp);

        // Lets check if there is any error after reading
        if (bytes_read == -1) {
                fclose(fp);
                return -1; 
        }

        // Now append the null char to the end
        string[bytes_read] = 0;
        printf("%s\n", string);
        fclose(fp);

        // free the memory once you are done with it
        if (string)
                free(string);


// Now u can write it back to file.
//      FILE* tp;
//      tp=fopen("exon", "w");
//      if (tp==NULL) puts("Failed to make tmp file");

//      fprintf(tp, "%s\n", string);
//      fclose(tp);
}

在C语言中以字符串形式读取文本文件的特定部分？

1 个答案: