将字符串分成单独的字符串/列

时间:2016-05-23 16:29:34

标签: powershell

这里的总菜鸟我有,我认为,一个简单的问题要解决,这完全把我带走了。

我有一个制表符分隔的数据集:

NS500418:110:H2VY7BGXX:4:21601:20699:7042  chrV    8256382 True    CATCTAAATTTTGTTAGGATG   chrV    8256540 True    GAATAATAGAAGAGGTACAGA   CATCTAAATTTTGTTAGGATGTTCTTCCTCGCCTTTTCTTTCTTAATTTAAGACGTCAAAAAGCAGCATATGACAGGGATTCTGGTATTCCAATGAGATCATTTTACCAATGACGAAAAAATACGTGAGGTGTTGCAAAATGACACAAAA  GAATAATAGAAGAGGTACAGAAAACGTTTGTGACGTGAAAAATGCTAAAAGCTCAAGCAATGGGTGGTCTTCTAGAACTCTGAAGAAACTGTGTTTTGTTTTCATGATCTCGGGATGCTTCAAAAACTGAAATGGGTGTCAAAGCAGGCC  CATCTAAATTTTGTTAGGATGTTCTTCCTCGCCTTTT   GAATAATAGAAGAGGTACAGAAAACGTTTGTGACGTGA  chrV    8256416 chrV    8256566
M03109:43:000000000-ACGWU:1:1102:11826:4015 chrIII  7513608 False   TCGTTTTTTGTTCTCTAACAC   chrX    15229802    False   TTTTAAGTACTACCTAAGAACC  TTCGCATGGATGTTTGATCCGAGAATTGGAGCTATTCTTATGCCAGTTAGTTTTTTTTCGTTTTTTGTTCTCTAACAC  ATTTTGTGAAGCAATTTGGCCTTTTTTTAGTTGATCTAATTATGCGTAAACACAATTTTTAAGTACTACCTAAGAACC  GTGTTAGAGAACAAAAAACGAAAAAAAACTAACTGGCATAAGAATAGCTCCAATTCTCGGATCGATCTAATTATGCGT  GGTTCTTAGGTAGTACTTAAAAATTGTGTTTACGCATAATTAGATCGATCCGAGAATTGGAGCTATTCTTATGCCAGT  chrIII  7513540 chrX    15229776
NS500418:110:H2VY7BGXX:4:11407:17860:12911  chrX    4775576 True    GGATAGTTTTAATTTTCTTGG   chrX    16142498    True    GAGTACTGCCGCGCGATCGAT   GGATAGTTTTAATTTTCTTGGATATTTTTAAATTCCGCTTAAAAACAACATTGTTAAGTCCGTTTTCACAGTTTGGAACTTTCTGTAAAATTGAGACTGGGAAAACTTAATGAAATAAAAGAATAGGTGCTCTTTACAAATTAAAAACAA  GAGTACTGCCGCGCGATCAATGATCTCCTTTTTGTTGGAGAAAAGATTGGAGATGACGTCTAGCGCAAGCTTTTGGCTTTCCGATTCAAGTTCTTGATCTGATAGTCTGGGAGCCTTGATTGGAGCAGCTGGGACTTTTGCAGGTTGGGA  GGATAGTTTTAATTTTCTTGGATATTTTTAAATTCCG   GAGTACTGCCGCGCGATCGATCTTAGAAATTAGTTAAA  chrX    4775610 chrX    16142526
NS500418:110:H2VY7BGXX:4:13612:12507:3869   chrX    11052325    False   GGTCCAGCAAAACGCAGTAAAC  chrI    14497739    True    GTGGTGGAGGAGGAACGAATG   TACTTAACCTTTGCTCCGCGGCAAAACATGATCATTTGTTCAAATAGACAATTTCGTTTTTTCTTTGACGATCAGAGTCAATGAAGTTATCTAAGGCAATCACAAAACATTTTTGAAAAGCAGCAACAGGTCCAGCAAAACGCAGTAAAC  GTGGTGGAGGAGGAACGAATGGTTGTGGTCCGGCGAGTGGGGCCACTTGTGGCACAAAAGCTTGATGTCGGAGCAGATTTGGGGCGATCCCGTCTCGATGCTCGCCCACTCGGCAAAGGCGTTGATTCGGCTGGAACAACAAGCGTCTTC  GTTTACTGCGTTTTGCTGGACCTGTTGCAGCTTTTCAA  GTGGTGGAGGAGGAACGAATGGTTGTGGTCCGGCGAGT  chrX    11052290    chrI    14497765
NS500418:110:H2VY7BGXX:3:11604:7974:16095   chrX    7483102 False   CTAGTTCAATGAGGTATGTCAT  chrX    5875247 False   AAAAAACTGATGGTCTTATAT   CTTGGCTCAAATAAAACTGAAATCGAAAATAAAGTTTTGCATGTAAATACATTTTCAGAGTGCCTACGACTATTACCATCGAGATCGACGCGAATATAGTGTACCCTGCTTTCCTCGTTCTCGCCAACCTAGTTCAATGAGGTATGTCAT  TCACAGCCACCGGATATTCTGAGATGCTTCTTTTTTTGTTGTTGTCGTTAGATGTACAGTGCCATTCCGCATATCATTGATGTTAGGATCATCTAGCATCTACCAGAATTTTTCCTTTCTCTGAATTCTAAAAAACTGATGGTCTTATAT  ATGACATACCTCATTGAACTAGGTTGGCGAGAACGAGG  ATATAAGACCATCAGTTTTTTAGAATTCAGAGAAAGGA  chrX    7483067 chrX    5875222
NS500418:110:H2VY7BGXX:1:12207:12144:18475  chrI    11267978    True    TTTTTAGGCAGTATTCTGTGAA  chrI    7633132 True    GTTTTTAAGGTTTTCATCGAT   TTTTTAGGCAGTATTCTGTGAACTTTCCTGCATAGTTTCCACTATGATCACCATTTTTCTAGCTCTCCTGGTTCTCACTACAAGTCCTGGACAAGTCGAGGTAAGGCTGTTTAGCCTAACCGGCCCAATGGGCCCTGCTAGGCCTCACAG  GTTTTTAAGGTTTTCATCGATTTTAATTAAATTTTTATTCCAGGATGCACCAGGAAGTGAATTCAATATGCAACAGATGACATCAATGCACGACGATTCGACAACATTCACGAATCCAGTGTATGAATTAGAAGATGTTGATATGTCATC  TTTTTAGGCAGTATTCTGTGAACTTTCCTGCATAGTTT  GTTTTTAAGGTTTTCATCGATTTTAATTAAATTTTTAT  chrI    11268013    chrI    7633159
NS500418:152:H25C7AFXX:3:11408:4830:8603    chrIV   2481023 False   TGAATCATATCAGGGCAGCTG   chrIV   2542156 False   CGTTGCTTGCAGTGTTCCCTT   GAATTTAAATTTCCTAGTGAAAAATGACAAAAAATTATGTTTTTGTAAAAAATATCTCGAAAAAATGTTTTTTTTTTCTTTTTTTCACCTAAAATTTTTTTGTTTCAGAATTTTGTGGGTGTTGATCTATGAATCATATCAGGTCAGCTG  TGAAAAAAAAAATTTGCCAAAAAAGATCAAAGAGGCGCCGCCGACAGAGAAGTGCACATGAATTATATTCAGCTGGAAATTGGAAACTGAGAGAAATCTGAATAAAACATAATTTTTTTCTCTTATTTCCGTTGCTTGCAGTGTTCCCTT  CAGCTGCCCTGATATGATTCATAGAGATCAAAGAGGCGCCGCCGACAGAGAAGTGCACATGAATTATATTCAGCTGGAAATTGGAAACTGAGAGAAATCTGAATACAACATAATTTTTTTCTCTTATTTCCGTTGCTTGCAGTGTTCCCTT chrIV   2480995 chrIV   2542026
我通过的

gc GSM2041038_n2_adults_dpn.TSV |
    sls -Pattern '(chrIV.*chrIV.*chrIV.*chrIV)' |
    Export-Csv OnlyChrIV.tsv -Delimiter "`t"

得到(我假设的)带有标题的制表符分隔文件,结果如下:

#TYPE Selected.System.Management.Automation.PSCustomObject
"IgnoreCase"    "LineNumber"    "Line"  "Filename"  "Path"  "Pattern"   "Context"   "Matches"
"True"  "32"    "NS500418:152:H25C7AFXX:3:11408:4830:8603   chrIV   2481023  False  TGAATCATATCAGGGCAGCTG   chrIV   2542156 False   CGTTGCTTGCAGTGTTCCCTT   GAATTTAAATTTCCTAGTGAAAAATGACAAAAAATTATGTTTTTGTAAAAAATATCTCGAAAAAATGTTTTTTTTTTCTTTTTTTCACCTAAAATTTTTTTGTTTCAGAATTTTGTGGGTGTTGATCTATGAATCATATCAGGTCAGCTG  TGAAAAAAAAAATTTGCCAAAAAAGATCAAAGAGGCGCCGCCGACAGAGAAGTGCACATGAATTATATTCAGCTGGAAATTGGAAACTGAGAGAAATCTGAATAAAACATAATTTTTTTCTCTTATTTCCGTTGCTTGCAGTGTTCCCTT  CAGCTGCCCTGATATGATTCATAGAGATCAAAGAGGCGCCGCCGACAGAGAAGTGCACATGAATTATATTCAGCTGGAAATTGGAAACTGAGAGAAATCTGAATACAACATAATTTTTTTCTCTTATTTCCGTTGCTTGCAGTGTTCCCTT chrIV   2480995 chrIV   2542026"    "InputStream"   "InputStream"   "(chrIV.*chrIV.*chrIV.*chrIV)"  ""  "System.Text.RegularExpressions.Match[]"

我想要的数据是在" Line"柱。所以我通过这个传递这个文件:

Import-Csv OnlyChrIV.tsv -Delimiter "`t" |
    select "line" |
    Export-Csv OnlyChrIV_OnlyLine.tsv -Delimiter "`t"

我会得到这个:

#TYPE Selected.System.Management.Automation.PSCustomObject
"Line"
"NS500418:152:H25C7AFXX:3:11408:4830:8603   chrIV   2481023 False   TGAATCATATCAGGGCAGCTG   chrIV   2542156 False   CGTTGCTTGCAGTGTTCCCTT   GAATTTAAATTTCCTAGTGAAAAATGACAAAAAATTATGTTTTTGTAAAAAATATCTCGAAAAAATGTTTTTTTTTTCTTTTTTTCACCTAAAATTTTTTTGTTTCAGAATTTTGTGGGTGTTGATCTATGAATCATATCAGGTCAGCTG  TGAAAAAAAAAATTTGCCAAAAAAGATCAAAGAGGCGCCGCCGACAGAGAAGTGCACATGAATTATATTCAGCTGGAAATTGGAAACTGAGAGAAATCTGAATAAAACATAATTTTTTTCTCTTATTTCCGTTGCTTGCAGTGTTCCCTT  CAGCTGCCCTGATATGATTCATAGAGATCAAAGAGGCGCCGCCGACAGAGAAGTGCACATGAATTATATTCAGCTGGAAATTGGAAACTGAGAGAAATCTGAATACAACATAATTTTTTTCTCTTATTTCCGTTGCTTGCAGTGTTCCCTT chrIV   2480995 chrIV   2542026"

我的问题是我现在无法将字符串拆分回原始列,因为我需要添加标题并从那里进一步处理数据。

我想要(最初格式化数据的方式):

"NS500418:152:H25C7AFXX:3:11408:4830:8603" "chrIV" "2481023" "False"   "TGAATCATATCAGGGCAGCTG" "chrIV" "2542156"

"NS500418:152:H25C7AFXX:3:11408:4830:8603" 
"chrIV"
"2481023"
"False"
"TGAATCATATCAGGGCAGCTG"
"chrIV"
"2542156"

我尝试过拆分,但这会为每个标签输出一个新行,如上例所示。我也不知道输入和/或输出是否是我应该在这里使用的方法。

这也需要针对一系列行进行。为清楚起见,我在这里仅使用一行作为示例。

1 个答案:

答案 0 :(得分:3)

请勿使用Select-String过滤数据。使用Import-Csv导入文件。如果您的文件没有标题行,则可以通过-Header参数指定自己的标题:

$inFile  = 'GSM2041038_n2_adults_dpn.TSV'
$outFile = 'OnlyChrIV.tsv'

$headers = 'H1', 'H2', ...

Import-Csv $inFile -Delimiter "`t" -Header $headers | Where-Object {
    $_.H2 -eq 'chrIV' -and
    $_.H6 -eq 'chrIV' -and
    $_.H14 -eq 'chrIV' -and
    $_.H16 -eq 'chrIV'
} | Export-Csv $outFile -Delimiter "`t" -NoType