我正在尝试读取一个.txt
文件,其中包含使用pandas
的字符串条目。此文件中的不同行具有不同的列数。可以找到文件here。
这就是我尝试读取文件的方式。
pd.read_csv('file.txt', sep=r'\s+', header=None).values[:,1:].astype('<U100')
使用上述方法读取文件时出现以下错误:
ParserError: Error tokenizing data. C error: Expected 82 fields in line 4, saw 85
我读了this Stackoverflow post。而且,我现在尝试了这种方法:
pd.read_csv('file.txt', error_bad_lines=False, sep=r'\s+', header=None).values[:,1:].astype('<U100')
上面的方法没有给出任何错误,但是现在在读取文件的过程中跳过了多行。有什么方法可以使我正确地(全部行)阅读上述file而没有错误?
答案 0 :(得分:1)
这使大量数据(从695行到475行)减少了。但是该文件仍然是垃圾。最好在进入python之前对其进行预处理。
[ins] In [20]: df = pd.read_csv("/tmp/file.txt", delim_whitespace=True, error_bad_lines=False, warn_bad_lines=False, header=None)
[ins] In [21]: df.shape
Out[21]: (474, 82)
答案 1 :(得分:1)
您可以使用_io.TextIOWrapper
方法readlines()
在文件外创建一个字符串嵌套列表系统(文件中每一行都有一个子列表)。这就是构建DataFrame
所需的所有熊猫:
import pandas as pd
with open('file.txt', 'r') as f:
file_lines = f.readlines()
keymap = pd.DataFrame([string.split('\t') for string in file_lines])
这将产生:
>>> keymap
0 1 2 3 4 5 6 \
0 TF: onecut2 ttc14 zadh2 pygm tiparp mgat4a man2a1
1 ppi_28 cep135 zranb1 strn stk24 strn3 fgfr1op2
2 ppi_29 hspb1 rps6ka5 mbp mapk13 mapkapk3 mapk11
3 TF: pou2af1 slc25a12 zbtb25 unk aif1 tmem54 apaf1
4 TF: rara kcnk4 gfer trip10 cog6 srebf1 zgpat
5 ppi_25 upf1 upf3a rbm8a xrn1 upf2 smg1
6 ppi_26 eif4g3 eif4e eif4a1 snora81 snord2 eif4a2
7 TF: rarb kcnk4 gfer trip10 cog6 srebf1 zgpat
8 ppi_20 traf3 nfatc2ip cd40 traf2 traf1 ltbr
9 ppi_21 bmp2 acvr2a bmp7 acvr2b bmp6 bmpr2
10 TF: rarg kcnk4 gfer trip10 cog6 srebf1 zgpat
11 ppi_23 tgif2 rbbp8 rnf8 mre11a nbn recql5
12 TF: pou5f1 slc25a12 zbtb25 unk aif1 tmem54 apaf1
13 TF: apc rab34 lsm3 calm2 rbl1 gapdh prkce
14 TF: elf2 sdccag8 pbxip1 ctsw slc35f2 rara fermt3
15 TF: elf4 fermt3 tmem204 s100a4 ager ptpn6 kdm6b
16 ppi_24 hspa1b hspa1a sox9 dnajc3 apaf1 brsk1
17 ppi_148 drg1 ncapg2 tal1 lyl1 ncapg ncapd2
18 TF: topors cnpy4 rcn3 rtn2 abi2 kcnd1 lmnb1
19 ppi_146 upf1 upf3a rbm8a xrn1 upf2 smg1
20 ppi_147 ube2v1 ube2v2 tyms zranb2 atp6v1b2 sssca1
21 ppi_144 srebf2 tada2b insig2 srebf1 klf13 zbtb7c
22 ppi_145 mthfr naa38 dhx16 lsm1 pyroxd1 lsm2
23 ppi_142 ntrk1 sgsm3 rasgrf1 bdnf kidins220 ntrk2
24 ppi_143 copb2 arcn1 arl1 copg2 copa tapbp
25 ppi_140 rap2a rap2b ralgds pik3ca rap1a rapgef5
26 ppi_141 cxcl10 irf1 irf5 irf3 irf7 stat2
27 ppi_204 mir196b pbx2 pknox1 pbx1 meis2 meis1
28 ppi_27 acvr1 bmp2 bmp7 smad1 btg2 smad6
29 TF: stat6 rhoc rdh5 pbxip1 ctsw rxrb mitd1
.. ... ... ... ... ... ... ...
666 TF: smad4 ndufs8 ahdc1 tpp1 cables1 rxrb acy1
667 TF: smad5 ahdc1 acy1 rara tctex1d4 wnt10b tmem204
668 TF: gata4 zbtb25 id2 sdhd ube2b ahdc1 arl6ip5
669 TF: hsf2 cbx4 ppm1l celsr3 hoxa7 kdm6b fli1
670 TF: gata2 zbtb25 id2 arl4a dctn3 ube2b arl6ip5
671 TF: smad1 ahdc1 rxrb acy1 rara tctex1d4 wnt10b
672 TF: smad2 ahdc1 rxrb acy1 rara tctex1d4 wnt10b
673 TF: gata1 mefv dnajb2 pck2 zbtb25 rac2 id2
674 TF: nr1h4 exd1 epha1 c1qtnf6 gfer ulk3 rxrb
675 TF: rxrg kcnk4 gfer trip10 cog6 srebf1 zgpat
676 TF: rxra nol7 exd1 hspbp1 kcnk4 arhgef37 epha1
677 TF: rxrb kcnk4 arhgef37 gfer baiap3 trip10 cog6
678 TF: nr1h3 hspbp1 kcnk4 rdh5 kars trip10 cog6
679 TF: ascl1 jmjd8 zc3h12a ptprcap ube2j2 tmem204 slc34a3
680 TF: rest acd lhx3 gripap1 l1cam hhatl ptprcap
681 TF: nfic eif4g3 il10rb gfer nyx arl6ip5 mettl10
682 TF: crem pitpna acd gfer fam131a tpp1 fscn1
683 ppi_208 hist1h4c hist1h4f hist1h4d hist1h4k hist1h4j hist1h4i
684 TF: arntl acy1 lrrc56 tmem204 zzz3 cirbp fasn
685 TF: nhlh1 smad6 brsk2 fam131a idi1 f2rl1 ap4b1
686 TF: myf6 jmjd8 zc3h12a ptprcap ube2j2 tmem204 slc34a3
687 TF: stat5b rdh5 ada sdccag8 gpr182 casp2 ctsw
688 TF: stat5a rdh12 ttc32 rdh5 ada pbxip1 tbx6
689 TF: maz jmjd8 ahdc1 rxrb rara slc34a3 cldn6
690 TF: brca1 ahdc1 gps2 tctex1d4 cirbp cbx4 ptpn6
691 TF: hes1 tcf3 polr2l lrrc56 tmem204 nck1 zfyve9
692 TF: crx trip10 fam131a rxrb ovol1 nfkbib mrpl24
693 TF: hand1 slc34a3 cirbp ptpn6 fasn kdm6b zbtb7b
694 TF: hand2 slc34a3 cirbp ptpn6 fasn kdm6b zbtb7b
695 TF: maf dnmt3a clcf1 acy1 tctex1d4 gapdh plekhh3
7 8 9 ... 770 771 772 773 \
0 zswim5 tubd1 igf2bp3 ... None None None None
1 sike1 cttnbp2 slmap ... None None None None
2 pla2g4a atf2 mapkapk5 ... None None None None
3 dok2 fam60a rab4b ... None None None None
4 rxrb clcf1 fyttd1 ... None None None None
5 parn edc4 dcp2 ... None None None None
6 mknk1 pdcd4 mknk2 ... None None None None
7 rxrb clcf1 fyttd1 ... None None None None
8 traf5 tnfrsf17 tnfrsf18 ... None None None None
9 bmpr1a bmpr1b gdf9 ... None None None None
10 rxrb clcf1 fyttd1 ... None None None None
11 rrm2b fancd2 dclre1c ... None None None None
12 dok2 fam60a rab4b ... None None None None
13 rrm1 irf4 actr1b ... None None None None
14 wnt10b tmem204 s100a4 ... None None None None
15 zbtb7b rnf167 ppp1ca ... None None None None
16 mos snrk hsbp1 ... None None None None
17 ncapd3 smc2 lmo1 ... None None None None
18 agfg1 gtf2a1l cbwd1 ... None None None None
19 parn slbp dcp2 ... None None None None
20 trip6 uchl3 usp9x ... None None None None
21 sec24b scap rnf139 ... None None None None
22 lsm3 wdr44 echdc2 ... None None None None
23 dok5 ngfr shc2 ... None None None None
24 copz2 sacm1l copz1 ... None None None None
25 rapgef6 mras rasip1 ... None None None None
26 pmaip1 mafb irf9 ... None None None None
27 hoxd9 hoxa9 hoxb1 ... None None None None
28 bmpr2 zeb1 smad7 ... None None None None
29 zadh2 snx13 cfl1 ... None None None None
.. ... ... ... ... ... ... ... ...
666 rara tctex1d4 wnt10b ... None None None None
667 slc34a3 grk6 kdm6b ... None None None None
668 rara timm8b daam1 ... None None None None
669 taf10 armc5 zhx2 ... None None None None
670 rxrb mrpl49 tctex1d4 ... None None None None
671 polr2l tmem204 slc34a3 ... None None None None
672 polr2l tmem204 slc34a3 ... None None None None
673 trip10 mxd3 arl4a ... None None None None
674 tpi1 rara gapdh ... None None None None
675 rxrb clcf1 fyttd1 ... None None None None
676 c1qtnf6 gfer rdh5 ... mapkapk2 ptch1 creb3l4 rpl23a
677 srebf1 zgpat rxrb ... None None None None
678 srebf1 col7a1 tekt4 ... None None None None
679 cirbp ptpn6 fasn ... None None None None
680 ppa1 gpr6 syt6 ... None None None None
681 rara gapdh atg9a ... None None None None
682 pafah1b1 mlf2 wnt10b ... None None None None
683 hist1h4h hist1h4b hist1h3c ... None None None None
684 kdm6b cpsf3l pprc1 ... None None None None
685 zfyve9 slc34a3 syt6 ... None None None None
686 cirbp ptpn6 fasn ... None None None None
687 gmfg vps53 ptpn6 ... None None None None
688 casp2 cxcr2 ctsw ... None None None None
689 cbx4 thoc6 isyna1 ... None None None None
690 isyna1 rnf44 hoxa7 ... None None None None
691 slc34a3 cirbp cbx4 ... None None None None
692 cnot4 fbxl19 zbtb7b ... None None None None
693 pkn1 nr1d1 map2k3 ... None None None None
694 pkn1 nr1d1 map2k3 ... None None None None
695 klc1 il7r kdm6b ... None None None None
774 775 776 777 778 779
0 None None None None None None
1 None None None None None None
2 None None None None None None
3 None None None None None None
4 None None None None None None
5 None None None None None None
6 None None None None None None
7 None None None None None None
8 None None None None None None
9 None None None None None None
10 None None None None None None
11 None None None None None None
12 None None None None None None
13 None None None None None None
14 None None None None None None
15 None None None None None None
16 None None None None None None
17 None None None None None None
18 None None None None None None
19 None None None None None None
20 None None None None None None
21 None None None None None None
22 None None None None None None
23 None None None None None None
24 None None None None None None
25 None None None None None None
26 None None None None None None
27 None None None None None None
28 None None None None None None
29 None None None None None None
.. ... ... ... ... ... ...
666 None None None None None None
667 None None None None None None
668 None None None None None None
669 None None None None None None
670 None None None None None None
671 None None None None None None
672 None None None None None None
673 None None None None None None
674 None None None None None None
675 None None None None None None
676 npff prkcdbp tmem25 bcl9l ap2b1 klf15\n
677 None None None None None None
678 None None None None None None
679 None None None None None None
680 None None None None None None
681 None None None None None None
682 None None None None None None
683 None None None None None None
684 None None None None None None
685 None None None None None None
686 None None None None None None
687 None None None None None None
688 None None None None None None
689 None None None None None None
690 None None None None None None
691 None None None None None None
692 None None None None None None
693 None None None None None None
694 None None None None None None
695 None None None None None None
[696 rows x 780 columns]
我希望这会有所帮助!最好!
D。