Question

我有一个大空格分隔的.txt文件（大约50 MB），文件的结构如下所示。我想摆脱前8个空格分隔列。

L1045 +++$+++ u0 +++$+++ m0 +++$+++ BIANCA +++$+++ They do not!
L1044 +++$+++ u2 +++$+++ m0 +++$+++ CAMERON +++$+++ They do to!
L985 +++$+++ u0 +++$+++ m0 +++$+++ BIANCA +++$+++ I hope so.
L984 +++$+++ u2 +++$+++ m0 +++$+++ CAMERON +++$+++ She okay?
L925 +++$+++ u0 +++$+++ m0 +++$+++ BIANCA +++$+++ Let's go.
L924 +++$+++ u2 +++$+++ m0 +++$+++ CAMERON +++$+++ Wow
L872 +++$+++ u0 +++$+++ m0 +++$+++ BIANCA +++$+++ Okay -- you're gonna need to learn how to lie.

所需的输出（以.txt为单位）：

They do not!
They do to!
I hope so.
She okay?
...

如何在RMS中使用Python 2.7或3.4（请指定版本）或使用linux命令行？谢谢！

Answer 1

在我的Linux系统（Ubuntu 12.04）上，这很好用：

cut -f 9- -d " " tmp.tmp >newfile.out

-f 9-指定字段9以后; -d " "指定以空格分隔。

我的猜测是，这是非常快的（因为cut是一个完全用于此目的的工具）。它可能可以在几行Python中完成，但可能会慢一些（？）;在R中这样做可能会很慢/效率低。

Answer 2

一种R方法：

txt <- "L1045 +++$+++ u0 +++$+++ m0 +++$+++ BIANCA +++$+++ They do not!
L1044 +++$+++ u2 +++$+++ m0 +++$+++ CAMERON +++$+++ They do to!
L985 +++$+++ u0 +++$+++ m0 +++$+++ BIANCA +++$+++ I hope so.
L984 +++$+++ u2 +++$+++ m0 +++$+++ CAMERON +++$+++ She okay?
L925 +++$+++ u0 +++$+++ m0 +++$+++ BIANCA +++$+++ Let's go.
L924 +++$+++ u2 +++$+++ m0 +++$+++ CAMERON +++$+++ Wow
L872 +++$+++ u0 +++$+++ m0 +++$+++ BIANCA +++$+++ Okay -- you're gonna need to learn how to lie."

txt_obj <- readLines(textConnection(txt))
txt8 <- gsub( "^(([^ ]+[ ]){8})", "", txt_obj)
txt8
#----------
[1] "They do not!"                                  
[2] "They do to!"                                   
[3] "I hope so."                                    
[4] "She okay?"                                     
[5] "Let's go."                                     
[6] "Wow"                                           
[7] "Okay -- you're gonna need to learn how to lie."

Answer 3

使用Python切片很容易做到这一点：

with open('in_file') as in_f:
    with open('out_file', 'w') as out_f:
        for i in [i.strip() for i in in_f if i != '\n']:
            out_f.write(' '.join(i.split()[8:]) + '\n')

Answer 4

这会删除最后一个+++

中的所有字符

sed 's/.*+++[[:blank:]]\+//' file

如何删除.txt

4 个答案: