Question

有问题的文件是RNAseq的堆积文件。我想提取一条染色体上的信息。这适用于较小的文件：

cd / ;
cd /local/apps/coreservices-s3/;
echo -e "The file is present in\n ";

find . -type f -name "TextFile4*";

echo -e "The content of the file is :\n ";
find . -name "vccvdv6.txt" |xargs cat

echo -e "\n\n."

错误代码：

awk '/chrM/ { print }' file1.pileup > file1.chrm.pileup

是否有替代命令或子命令来规避？

感谢您的帮助。

编辑：

数据如下：

awk: (FILENAME=file1.pileup FNR=1743118775) fatal: grow_iop_buffer: iop->buf: can't allocate 137438953474 bytes of memory (Cannot allocate memory)

它是3529769718150字节。

我希望找到（基本上是一排排，大约在下降的70-75％之间）：

chr1    258755  T       1       .                 F
chr1    258756  C       1       ......            F
chr1    258757  T       1       ...               H
chr1    258758  A       1       ...........       H

Edit2：

head -n 1 File1的输出| od -c'：

chrM    6432       C       1       ^~.            B
chrM    7294       A       1       ........       B
chrM    7296       G       1       .....          B

'head -c xxx File1的输出| od -c'：

0000000   c   h   r   1  \t   2   5   8   7   4   9  \t   T  \t   1  \t
0000020   ^   ~   .  \t   C  \n
0000026

'head -c 100 File1的输出| od -c'：

head: xxx: invalid number of bytes
0000000

Answer 1

听起来像您的grep命令可能无法处理大于2.4 GB的文件，因为32位指针无法访问它们。

尝试运行

split --line-bytes = 2GB file1.pileup

这会将您的文件分为两部分，您应该可以根据需要对其进行处理。

Answer 2

您可以在此处使用grep -F（固定文本搜索）代替awk：

grep -wF 'chrM' file1.pileup > file1.chrm.pileup

如果您真的想使用awk，那么更快，更短的命令将避免使用正则表达式：

awk 'index($0, "chrM")' file1.pileup > file1.chrm.pileup

Answer 3

我想知道避免正则表达式是否会取得更好的成功：

awk '$1 == "chrM"' file1.pileup > file1.chrm.pileup

我想知道您的文件是否“损坏”，并且文件中某处只有一行即137438953474字节长。你可以尝试一下吗？

awk '{print NR, NF, length($0)}' file1.pileup > file1.line_lengths

看看它从哪里冒出来？

尝试awk文件，但无法分配足够的内存。有其他选择或调整吗？

3 个答案: