我从另一个脚本输出了一个很长的txt文件,我想搜索信息的选择性部分,然后将其输入到更清晰的.csv文件中。
当前我的输出是这样的(摘要):
>Klondike/2018_36/110218_new/anacapa/QC/fastq/MM12-112-pcr-mamm-S51-L001_2.fastq
>Running on 1 core
>Trimming 16 adapters with at most 30.0% errors in paired-end mode ...
>Finished in 7.48 s (260 us/read; 0.23 M reads/minute).
>=== Summary ===
>Total read pairs processed: 28,794
> Read 1 with adapter: 28,248 (98.1%)
> Read 2 with adapter: 3,232 (11.2%)
>Pairs written (passing filters): 28,794 (100.0%)
我想获取最后一个/之后和.fastq之前的样本名称,已处理的总读取对数以及已写入的总对数,并从中制作一个csv文件。 问题在于,并非所有样本都进行了任何读取,并且某些样本名称多次出现。我创建了RegEx模式以打开三个所需的输出,但是在将这些搜索转换为CSV并在样本没有任何读取时输入None时遇到了麻烦。
当它经过类似这样的操作时,我需要保留样本名称并输入0,none,NA或类似的内容,但不要将此条目排除在外。
>Klondike/2018_36/110218_new/anacapa/QC/fastq/MM12-112-pcr-beet-S106-L001_2.fastq
Running on 1 core
Trimming 16 adapters with at most 30.0% errors in paired-end mode ...
No reads processed! Either your input file is empty or you used the wrong -f/--format parameter.
到目前为止,这是我想要的,我试图将其存储到一个命名的元组中,或者也许我接下来将尝试字典,但是我很迷茫,不知道从这里去哪里。 / p>
import pandas as pd
import re
import collections
from pathlib import Path
data = Path("cutadapt-report.txt").read_text()
split_data = data.split("Command line parameters: -e .3 -f fastq -g file:")
Cleaned = collections.namedtuple('Cleaned', 'Sample_Name Total_Read_Pairs_Processed Pairs_Written')
def clean_adapt(filename):
try:
data = Path(filename).read_text()
except FileNotFoundError:
return 'Having trouble locating that file, please try again'
split_data = data.split("Command line parameters: -e .3 -f fastq -g file:")
pattern_pairs = r"(?<=Total read pairs processed: ) *\d+,?\d+"
pattern_name = r"((?<=QC\/fastq\/)\S+(?=-S))(?!.*\1)"
pattern_written = r"(?<=Pairs written \(passing filters\): ) *\d+,?\d+"
lines = re.findall(pattern_name, data)
pp = []
wr = []
for entry in split_data:
ok = re.findall(pattern_pairs, entry)
writ = re.findall(pattern_written, str(split_data))
pp.append(ok)
wr.append(writ)
print(lines)
# return Cleaned(lines, pp, wr)
clean_adapt("cutadapt-report.txt")
我的CSV文件应如下所示:
Sample ID, Total Read Pairs Processed, Pairs Written
MM12-112-pcr-mamm, 28,794, 28,794