我有一个包含许多DNA序列的文本文件,每个序列在一行中有20个碱基对。我想将文件读入一个数据帧,每个基数作为自己的列,而无需使用for循环或其他需要遍历整个文件的操作,因为文件很大。
我尝试使用“”作为定界符,但它只会导致将整行处理为一列。我也尝试过使用“。”和“ \ w”都没有达到我想要的效果。
例如,对于具有以下内容的文件:
;; Font Locking, Programming Modes, and Compilation settings
;;
(global-font-lock-mode 1)
;; maximum colors
(setq font-lock-maximum-decoration t)
;; extra key bindings
(global-set-key "\M-C" 'compile)
(global-set-key "\C-^" 'next-error)
(global-set-key "\C-\M-g" 'goto-line)
;; use spaces instead of tabs
(setq-default indent-tabs-mode nil)
;; haskell mode configuration
(setq auto-mode-alist
(append auto-mode-alist
'(("\\.[hg]s$" . haskell-mode)
("\\.hic?$" . haskell-mode)
("\\.hsc$" . haskell-mode)
("\\.chs$" . haskell-mode)
("\\.l[hg]s$" . literate-haskell-mode))))
(autoload 'haskell-mode "haskell-mode"
"Major mode for editing Haskell scripts." t)
(autoload 'literate-haskell-mode "haskell-mode"
"Major mode for editing literate Haskell scripts." t)
;adding the following lines according to which modules you want to use:
(require 'inf-haskell)
(add-hook 'haskell-mode-hook 'turn-on-font-lock)
;(add-hook 'haskell-mode-hook 'turn-off-haskell-decl-scan)
;(add-hook 'haskell-mode-hook 'turn-off-haskell-doc-mode)
(add-hook 'haskell-mode-hook 'turn-on-haskell-indent)
;(add-hook 'haskell-mode-hook 'turn-on-haskell-simple-indent)
;(add-hook 'haskell-mode-hook 'turn-on-haskell-hugs)
(add-hook 'haskell-mode-hook 'turn-on-haskell-ghci)
(add-hook 'haskell-mode-hook
(function
(lambda ()
(setq haskell-program-name "ghci")
(setq haskell-ghci-program-name "ghci6"))))
数据框应如下所示:
ACGT
CGTA
GTAC
TACG
答案 0 :(得分:5)
您可以将其读为一列,以后再拆分
# csv
# ATGC
# CTAG
df = pd.read_csv(header=None)
# df
# 0
# 0 ATGC
# 1 CTAG
df[0].str.split('', expand=True)
输出:
0 1 2 3 4 5
0 A T G X
1 G T A X
表示您有两列,一列在前,一列在后。但是您可以轻松地将其删除,例如:
df[0].str.split('', expand=True).iloc[:,1:-1]
给予:
1 2 3 4
0 A T G C
1 C T A G
答案 1 :(得分:3)
您可以使用pandas.read_fwf
代替pandas.read_csv
来完成此操作。
如果您的文件名为“ dna.txt”,如下所示:
ACGT
CGTA
GTAC
TACG
您可以执行以下操作:
df = pd.read_fwf("dna.txt", header=None, widths=[1] * 4)
print(df)
要输出:
0 1 2 3
0 A C G T
1 C G T A
2 G T A C
3 T A C G