在文本文件中选择特定的行和单元格并放入数据框:python或R.

时间:2017-02-28 16:09:04

标签: python r pandas dataframe data.table

python或R可以用于此,但有人可以告诉我如何选择"基本统计数据"行一个看起来像下面的文本文件。我希望将此信息和ROI的名称放在pandas数据框中或作为R中的数据表。

---> output, state = tf.nn.dynamic_rnn(cell=cell, inputs=X, dtype=tf.float32)
ValueError: Dimension must be 2 but is 3 for 'transpose_42' (op: 'Transpose') with input shapes: [?,1], [3]

最终输出应该如下所示:

ROI: mrc_ranch_house [Red] 195 points

Basic Stats        Min       Max         Mean      Stdev
     Band 1 -20.208261  6.025762    -8.866403   5.289712

Histogram           DN     Npts   Total  Percent     Acc Pct
Band 1      -20.208261        1       1   0.5128      0.5128
Bin=0.10287 -20.105383        0       1   0.0000      0.5128
            -20.002504        1       2   0.5128      1.0256
            -19.899626        0       2   0.0000      1.0256
            -19.796747        0       2   0.0000      1.0256
            -19.693869        0       2   0.0000      1.0256
            -19.590990        0       2   0.0000      1.0256
            -19.488112        0       2   0.0000      1.0256

Stats for ROI: river_1 [Blue] 90 points                     
Basic Stats        Min        Max         Mean     Stdev        
     Band 1 -20.187374  -6.694543   -12.227586  2.66464     

Histogram           DN     Npts   Total  Percent     Acc Pct    
Band 1      -20.187374  1   1   1.1111  1.1111  
Bin=0.05291 -20.134461  0   1   0   1.1111  
        -20.081548  0   1   0   1.1111  
        -20.028635  0   1   0   1.1111  
        -19.975722  0   1   0   1.1111  


Stats for ROI: river_2 [Blue] 96 points                 
Basic Stats        Min        Max         Mean     Stdev    
     Band 1 -18.365091  -5.820825   -13.164463  2.851231    

 Histogram              DN     Npts   Total  Percent     Acc Pct
 Band 1         -18.365091  1   1   1.0417  1.0417
 Bin=0.04919    -18.315898  0   1   0   1.0417
        -18.266705  0   1   0   1.0417
        -18.217512  0   1   0   1.0417

......等等。

谢谢!

3 个答案:

答案 0 :(得分:4)

使用R,使用:

# read the text file
txt <- readLines('https://dl.dropboxusercontent.com/u/45095175/rois_all.txt')

# create an index for the lines that are needed
ti <- rep(which(grepl('ROI:', txt)), each = 3) + 1:3
# create a grouping vector of the same length
grp <- rep(1:33, each = 3)

# filter the text with the index 'ti' 
# and split into a list with grouping variable 'grp'
lst <- split(txt[ti], grp)
# loop over the list a read the text parts in as dataframes
lst <- lapply(lst, function(x) read.table(text = x, sep = '\t', header = TRUE,
                                          blank.lines.skip = TRUE))

# bind the dataframes in the list together in one data.frame
DF <- do.call(rbind, lst)
# change the name of the first column
names(DF)[1] <- 'ROI'

# get the correct ROI's for the ROI-column
DF$ROI <- sub('.*: (\\w+).*$', '\\1', txt[grepl('ROI: ', txt)])

给出:

> DF
                ROI        Min        Max       Mean    Stdev
1   mrc_ranch_house -20.208261   6.025762  -8.866403 5.289712
2           river_1 -20.187374  -6.694543 -12.227586 2.664640
3           river_2 -18.365091  -5.820825 -13.164463 2.851231
4           river_3 -18.291010  -4.583666 -12.092995 3.479293
5           river_4 -17.074295  -4.926921  -9.970926 2.897855
6           river_5 -16.849176  -8.622208 -12.387085 2.168462
7  adjacent_river_2 -18.987597  -7.957749 -13.392523 1.962263
8  adjacent_river_3 -19.426531  -8.640042 -13.467425 1.888105
9  adjacent_river_4 -20.452566  -6.830183 -12.833450 2.124761
10           bcs_1_ -23.612043  -8.221417 -16.032305 2.080695
11           bcs_2_ -24.018219 -10.648975 -16.814048 1.948863
12           bcs_3_ -23.011086  -9.106754 -15.404174 1.867498
13           red_1_ -22.313442  -7.839107 -14.768196 2.134152
14           red_2_ -22.551537  -7.236300 -14.613618 2.204253
15           red_3_ -22.057703  -7.746992 -14.483161 2.123497
16            bcs_4 -22.705107  -8.972753 -15.201623 1.817122
17            bcs_5 -24.109459 -10.113716 -15.776537 1.849163
18         glade_1_ -19.913187  -6.189866 -12.695884 3.303929
19         glade_2_ -19.812855  -4.672865 -11.995191 4.840168
20         glade_3_ -10.078033  -2.828722  -5.877417 1.941401
21           mwea_b -13.979379  -4.977155 -11.392434 2.019037
22             kaga -13.114172  -8.889531 -10.649324 1.290551
23             huku -14.206743  -7.853305 -10.608210 1.441250
24             ruai -18.643108 -12.645180 -14.540123 1.224183
25          tumaini -19.543234 -13.164941 -15.899968 1.812876
26           nkando -19.973492  -7.040238 -11.716987 2.617544
27           jikaze -16.408030  -9.001065 -12.323898 1.942196
28        miarage_b -15.126486  -6.661448 -10.391111 1.764279
29           batian -15.269146  -9.603316 -11.962470 1.168859
30         gitaraga -17.037708  -7.495215 -10.886802 2.561877
31       wiumiririe  -9.578024  -6.225223  -7.688715 1.059796
32           chumvi -14.883148 -10.327570 -12.819469 1.231636
33 next_to_airstrip -17.242777  -5.207252 -10.601750 1.987712

最后一部分(从一个数据框及以后将列表绑定在一起)也可以使用rbindlist中的data.table - 函数来完成:

# load the 'data.table' package for the 'rbindlist' function
library(data.table)
# bind the dataframes in the list together to a data.table (enhanced version of a data.frame)
DT <- rbindlist(lst)
# change the name of the first column
setnames(DT, 1, 'ROI')

# get the correct ROI's for the ROI-column
DT[, ROI := sub('.*: (\\w+).*$', '\\1', txt[grepl('ROI: ', txt)])]

答案 1 :(得分:2)

这是另一个丑陋的解决方案。结果是一个好的旧常规data.frame

rois_all <- file("https://dl.dropboxusercontent.com/u/45095175/rois_all.txt")

xy <- readLines(rois_all)

# find lines where ROI starts
roin <- grep(pattern = "ROI: ", x = xy)
roi <- xy[roin]
roi <- gsub(".*ROI: (\\w+).*$", "\\1", roi)

# find lines with stats
stats <- grep(pattern = "Basic Stats", x = xy)

# trim whitespace and collect Col
cn <- trimws(sapply(strsplit(xy[stats][1], "\t"), "[", 2:5, simplify = FALSE)[[1]])

# split the stat line by \t and extract only elements 2 to 5. merge row-wise
out <- do.call(rbind, sapply(strsplit(xy[stats + 1], "\t"), "[", 2:5, simplify = FALSE))
out <- as.data.frame(apply(out, MARGIN = 2, as.numeric))

# add ROI column extracted earlier
out <- cbind(roi, out)

colnames(out) <- c("ROI", cn)

out

                ROI        Min        Max       Mean    Stdev
1   mrc_ranch_house -20.208261   6.025762  -8.866403 5.289712
2           river_1 -20.187374  -6.694543 -12.227586 2.664640
3           river_2 -18.365091  -5.820825 -13.164463 2.851231
4           river_3 -18.291010  -4.583666 -12.092995 3.479293
5           river_4 -17.074295  -4.926921  -9.970926 2.897855
6           river_5 -16.849176  -8.622208 -12.387085 2.168462
7  adjacent_river_2 -18.987597  -7.957749 -13.392523 1.962263
8  adjacent_river_3 -19.426531  -8.640042 -13.467425 1.888105
9  adjacent_river_4 -20.452566  -6.830183 -12.833450 2.124761
10           bcs_1_ -23.612043  -8.221417 -16.032305 2.080695
11           bcs_2_ -24.018219 -10.648975 -16.814048 1.948863
12           bcs_3_ -23.011086  -9.106754 -15.404174 1.867498
13           red_1_ -22.313442  -7.839107 -14.768196 2.134152
14           red_2_ -22.551537  -7.236300 -14.613618 2.204253
15           red_3_ -22.057703  -7.746992 -14.483161 2.123497
16            bcs_4 -22.705107  -8.972753 -15.201623 1.817122
17            bcs_5 -24.109459 -10.113716 -15.776537 1.849163
18         glade_1_ -19.913187  -6.189866 -12.695884 3.303929
19         glade_2_ -19.812855  -4.672865 -11.995191 4.840168
20         glade_3_ -10.078033  -2.828722  -5.877417 1.941401
21           mwea_b -13.979379  -4.977155 -11.392434 2.019037
22             kaga -13.114172  -8.889531 -10.649324 1.290551
23             huku -14.206743  -7.853305 -10.608210 1.441250
24             ruai -18.643108 -12.645180 -14.540123 1.224183
25          tumaini -19.543234 -13.164941 -15.899968 1.812876
26           nkando -19.973492  -7.040238 -11.716987 2.617544
27           jikaze -16.408030  -9.001065 -12.323898 1.942196
28        miarage_b -15.126486  -6.661448 -10.391111 1.764279
29           batian -15.269146  -9.603316 -11.962470 1.168859
30         gitaraga -17.037708  -7.495215 -10.886802 2.561877
31       wiumiririe  -9.578024  -6.225223  -7.688715 1.059796
32           chumvi -14.883148 -10.327570 -12.819469 1.231636
33 next_to_airstrip -17.242777  -5.207252 -10.601750 1.987712

答案 2 :(得分:1)

我没有找到一个导入解决方案,因为data中的每一行都被称为Band 1,但这是一个良好的开端。

import pandas as pd

data = pd.read_csv(r'rois_all.txt', delimiter='\t', error_bad_lines=False, skiprows=[0, 1])
data = data.dropna()
data = data.ix[data.ix[:, 'Basic Stats']!='Basic Stats', :]
data

输出示例

Basic Stats Min         Max         Mean        Stdev
0   Band 1  -20.208261  6.025762    -8.866403   5.289712
3   Band 1  -20.187374  -6.694543   -12.227586  2.664640
6   Band 1  -18.365091  -5.820825   -13.164463  2.851231

我现在已经提取了所有基本统计信息名称,如下所示,

names = pd.read_csv(r'rois_all.txt', delimiter='\t', error_bad_lines=False, skiprows=[0, 1])

names = names.ix[names.ix[:, 'Basic Stats'] != '     Band 1']
names = names.ix[names.ix[:, 'Basic Stats'] != 'Basic Stats']
names = names.ix[:, 'Basic Stats'].str.extract('Stats for ROI: (.*) \[.*\] [0-9]*')
names.loc[0] = 'mrc_ranch_house'
names = names.sort_index()
names = names.reset_index(drop=True)

这看起来如下,

0      mrc_ranch_house
1              river_1
2              river_2

像这样加入datanames

data.ix[:, 'Basic Stats'] = names

根据需要给出结果,

   Basic Stats      Min         Max         Mean        Stdev
0   mrc_ranch_house -20.208261  6.025762    -8.866403   5.289712
1   river_1         -20.187374  -6.694543   -12.227586  2.664640
2   river_2         -18.365091  -5.820825   -13.164463  2.851231