我正在编写读取包含DNA碱基的巨大文本文件的代码,我需要能够提取特定部分。该文件如下所示:
TGTTCCAGGCTGTCAGATGCTAACCTGGGG
TCACTGGGGGTGTGCGTGCTGCTCCAGCCT
GTTCCAGGATATCAGATGCTCACCTGGGGG
...
每行是30个字符。
我有一个单独的文件来指示这些部分,这意味着我有一个 开始 值和一个 结束 值。因此,对于每个 开始 和 结束 值,我需要在文件中提取相应的字符串。 例如,如果我有开始 = 10,结束 = 45,则需要存储以第一行(C)的第10个字符开头并以在单独的临时文件中第二行(C)的第15个字符。
我尝试将fread函数(如下所示)用于具有上述字母行的测试文件。参数为开始 = 1,结束 = 90,结果文件如下:
TGTTCCAGGCTGTCAGATGCTAACCTGGGG
TCACTGGGGGTGTGCGTGCTGCTCCAGCCT
GTTCCAGGATATCAGATGCTCACCTGGG™eRV
每次运行都会在末尾给出随机字符。
代码:
FILE* fp;
fp=fopen(filename, "r");
if (fp==NULL) puts("Failed to open file");
int start=1, end=90;
char string[end-start+2]; //characters from start to end = end-start+1
fseek(fp, start-1, SEEK_SET);
fread(exon,1, end-start+1, fp);
FILE* tp;
tp=fopen("exon", "w");
if (tp==NULL) puts("Failed to make tmp file");
fprintf(tp, "%s\n", string);
fclose(tp);
我不明白fread如何处理\ n字符,因此我尝试将其替换为以下内容:
int i=0;
char ch;
while (!feof(fp))
{
ch=fgetc(fp);
if (ch != '\n')
{
string[i]=ch;
i++;
if (i==end-start) break;
}
}
string[end-start+1]='\0';
它创建了以下文件: TGTTCCAGGCTGTCAGATGCTAACCTGGGGTCACTGGGGGTGTGCGTGCTGCTCCAGCCTGTTCCAGGATATCAGATGCTCACCTGGGGô
(没有任何换行符,我不介意)。 每次运行时,我都会得到一个不同的随机字符,而不是'G'。
我在做什么错?有没有一种方法可以通过fread或其他功能来完成?
谢谢。
答案 0 :(得分:1)
我已经修改了您的代码,并在其中添加了注释以供解释。
请仔细检查。您已经忽略了错误检查,代码中几乎没有未定义的变量。
如果失败,我已经从if
区块返回,那么goto`会更合适。
有关向start
和end
添加1个字符还是2个字符的信息,请参考this comment。
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
int main()
{
FILE* fp;
// fp = fopen(filename, "r");
// since the filename is undeclared i have used hard coded file name
fp = fopen("dna.txt", "r");
// Nothing wrong in performing error checking
if (fp == NULL) {
puts("Failed to open file");
return -1;
}
// Make sure start is not 0 if you want to use indices starting from 1
int start = 1, end = 90;
// I would adjust the start and end index by adding count of '\n' or '\r\n' to the start and end
// Here I am adjusting for '\n' i.e 1 char
// since you have 30 chars so hardcoding it.
int m = 1; // m depends on whether it is \n or \r\n
// 1 for \n and 2 for \r\n
--start; --end; // adjusting indexes to be 0 based
if (start != 0)
start = start + (start / 30) * m; // start will be 0
if (end != 0)
end = end + (end / 30) * m; // start will be 93
// lets declare the chars to read
int char_to_read = end - start + 1;
// need only 1 extra char to append null char
// If start and end is going to change, then i would suggest using malloc instead of static buffer
// because compiler cannot predict the memory to allocate to the buffer if it is dependent on external factor
// char string[char_to_read + 1]; //characters from start to end = end-start+1
char *string = malloc(char_to_read + 1);
if (string == NULL) {
printf("malloc failed\n");
fclose(fp);
return -2;
}
// zero the buffer
memset(string, 0, char_to_read + 1);
int rc = fseek(fp, start, SEEK_SET);
if (rc == -1) {
printf("fseek failed");
fclose(fp);
return -1;
}
// exon is not defined, and btw we wanted to read in string.
int bytes_read = fread(string, 1, char_to_read, fp);
// Lets check if there is any error after reading
if (bytes_read == -1) {
fclose(fp);
return -1;
}
// Now append the null char to the end
string[bytes_read] = 0;
printf("%s\n", string);
fclose(fp);
// free the memory once you are done with it
if (string)
free(string);
// Now u can write it back to file.
// FILE* tp;
// tp=fopen("exon", "w");
// if (tp==NULL) puts("Failed to make tmp file");
// fprintf(tp, "%s\n", string);
// fclose(tp);
}