从大的txt文件中获取数据并将选择性数据输入到csv中

时间:2018-11-07 18:09:49

标签: python pandas csv

我从另一个脚本输出了一个很长的txt文件,我想搜索信息的选择性部分,然后将其输入到更清晰的.csv文件中。

当前我的输出是这样的(摘要):

>Klondike/2018_36/110218_new/anacapa/QC/fastq/MM12-112-pcr-mamm-S51-L001_2.fastq
>Running on 1 core
>Trimming 16 adapters with at most 30.0% errors in paired-end mode ...
>Finished in 7.48 s (260 us/read; 0.23 M reads/minute).

>=== Summary ===

>Total read pairs processed:             28,794
>  Read 1 with adapter:                  28,248 (98.1%)
>  Read 2 with adapter:                   3,232 (11.2%)
>Pairs written (passing filters):        28,794 (100.0%)

我想获取最后一个/之后和.fastq之前的样本名称,已处理的总读取对数以及已写入的总对数,并从中制作一个csv文件。 问题在于,并非所有样本都进行了任何读取,并且某些样本名称多次出现。我创建了RegEx模式以打开三个所需的输出,但是在将这些搜索转换为CSV并在样本没有任何读取时输入None时遇到了麻烦。

当它经过类似这样的操作时,我需要保留样本名称并输入0,none,NA或类似的内容,但不要将此条目排除在外。

>Klondike/2018_36/110218_new/anacapa/QC/fastq/MM12-112-pcr-beet-S106-L001_2.fastq
Running on 1 core
Trimming 16 adapters with at most 30.0% errors in paired-end mode ...
No reads processed! Either your input file is empty or you used the wrong -f/--format parameter.

到目前为止,这是我想要的,我试图将其存储到一个命名的元组中,或者也许我接下来将尝试字典,但是我很迷茫,不知道从这里去哪里。 / p>

import pandas as pd
import re
import collections
from pathlib import Path

data = Path("cutadapt-report.txt").read_text()
split_data = data.split("Command line parameters: -e .3 -f fastq -g file:")


Cleaned = collections.namedtuple('Cleaned', 'Sample_Name Total_Read_Pairs_Processed Pairs_Written')

def clean_adapt(filename):
    try:
        data = Path(filename).read_text()
    except FileNotFoundError:
         return 'Having trouble locating that file, please try again'
    split_data = data.split("Command line parameters: -e .3 -f fastq -g file:")
    pattern_pairs = r"(?<=Total read pairs processed:             ) *\d+,?\d+"  
    pattern_name = r"((?<=QC\/fastq\/)\S+(?=-S))(?!.*\1)"
    pattern_written = r"(?<=Pairs written \(passing filters\):        ) *\d+,?\d+"
    lines = re.findall(pattern_name, data)   
    pp = []
    wr = []
    for entry in split_data:
        ok = re.findall(pattern_pairs, entry)
        writ = re.findall(pattern_written, str(split_data))
        pp.append(ok)
        wr.append(writ)

    print(lines)
#    return Cleaned(lines, pp, wr)

clean_adapt("cutadapt-report.txt")

我的CSV文件应如下所示:

 Sample ID, Total Read Pairs Processed, Pairs Written
    MM12-112-pcr-mamm, 28,794, 28,794

0 个答案:

没有答案