我有一个看起来像这样的文件,包含大量数据
>ENSMUSG00000020333|ENSMUST00000000145|Acsl6
AGCTCCAGGAGGGCCCGTCTCAGTCCGATGAACTTTGCAGCAATATTATAGTTATTCGTG
GTTCACAGAATTCCATTAAACATAAAGAAAAAACATAA
>ENSMUSG00000000001|ENSMUST00000000001|Gnai3
GAGGATGGCATAGTAAAAGCTATTACAGGGAGGAGTGTTGAGACCAGATGTCATCTACTG
CTCTGTAATCTAATGTTTAGGGCATATTGAAGTTGAGGTGCTGCCTTCCAGAACTTAAAC
应该转换列,以便行始终包含:
ENSMUSG*** ENSMUST*** GeneName Sequence (four separate columns)
序列列应该是以A,C,G或T开头的行融合到一个文本单元格中,融合的细胞数量因基因而异。
有没有人有建议如何解决这个问题?
非常感谢你的帮助! 最好的祝愿 KK
答案 0 :(得分:1)
使用Text to Columns
标签上的Data
按钮。选择Delimited
,点击Next
,然后选择Other
,在框中输入管道符号|。然后点击Next
和Finish
。
答案 1 :(得分:0)
我认为只有那些拥有Office 365订阅的人才有工作表函数CONCAT
,这在这种情况下可能会有用。所以我会用VBA宏来做这件事。
|
分隔符Collection
对象由于您拥有一个大型数据库,因此“工作”在VBA阵列中完成,因为这样可以更快地处理。
假设您的数据位于A列,从A1开始;并且您的结果将写在B列中:E
如果您的数据库是干净的,并且在显示时已格式化,那么它应该可以正常工作。如果它不属于您提供的格式,则可能需要添加一些错误检查。
Option Explicit
Sub Organize()
Dim COL As Collection
Dim vSrc As Variant, vRes As Variant
Dim WS As Worksheet, rRes As Range
Dim V As Variant, W As Variant, S As String
Dim I As Long, J As Long
Set WS = ActiveSheet
With WS
Set rRes = .Cells(1, 2)
vSrc = .Range(.Cells(1, 1), .Cells(.Rows.Count, 1).End(xlUp))
End With
Set COL = New Collection
For J = 1 To UBound(vSrc, 1)
ReDim vRes(0 To 3)
W = Split(vSrc(J, 1), "|") 'First line
For I = 0 To 2
vRes(I) = W(I)
Next I
S = ""
'Concatenate subsequent lines
'Could look for the "<" but OP gave specifice starting letters
' So will use that
Do
Select Case Left(vSrc(J + 1, 1), 1)
Case "A", "C", "G", "T"
S = S & vSrc(J + 1, 1)
Case Else
Exit Do
End Select
J = J + 1
Loop Until J = UBound(vSrc, 1)
vRes(3) = S
COL.Add vRes
Next J
ReDim vRes(1 To COL.Count, 1 To 4)
I = 0
For Each W In COL
I = I + 1
For J = 1 To 4
vRes(I, J) = W(J - 1)
Next J
Next W
Set rRes = rRes.Resize(rowsize:=UBound(vRes, 1), columnsize:=UBound(vRes, 2))
With rRes
.EntireColumn.Clear
.Value = vRes
.EntireColumn.AutoFit
End With
End Sub