Question

我有一个包含代码及其说明的文件。代码始终是一个短的（3-6个字符）字母串，通过空格与以下描述分开。描述通常是几个单词（也有空格）。这是一个例子：

LIISS License Issued
LIMOD License Modified
LIPASS License Assigned (Partial Assignment)
LIPND License Assigned (Partition/Disaggregation)
LIPPND License Issued from a Partial/P&D Assignment
LIPUR License Purged
LIREIN License Reinstated
LIREN License Renewed

我想将其作为2列数据框读取，第一列中的代码和第二列中的描述。我怎么能用R做这个？

Answer 1

我们可以使用readLines阅读此内容，然后使用data.frame

创建两列sub

#read the lines with readLines
lines <- readLines('pavel.txt')
#match one or more spaces followed by one or more characters
#replace with `''` to extract the non-space characters at the beginning.
str1 <- sub('\\s+.*', '', lines)
#match non space characters from the beginning (`^[^ ]+`) followed by space
#replace with `''` to extract the characters that follow after the space.
str2 <- sub('^[^ ]+\\s+', '', lines)
out <- data.frame(v1= str1, v2=str2, stringsAsFactors=FALSE)
head(out,3)
#      v1                                    v2
#1  LIISS                        License Issued
#2  LIMOD                      License Modified
#3 LIPASS License Assigned (Partial Assignment)

或者，在将数据集作为单个列读取后，extract的其他选项为library(tidyr)。我们使用捕获组来提取每列中需要的字符。这里([^ ]+)匹配一个或多个非空格，并用括号捕获，后跟一个或多个空格（我们删除），然后使用第二个捕获组提取其余字符。

library(tidyr)
extract(read.table('pavel.txt', sep=','), V1, 
                 into= c('V1', 'V2'), '([^ ]+)\\s+(.*)')
#      V1                                           V2
#1  LIISS                               License Issued
#2  LIMOD                             License Modified
#3 LIPASS        License Assigned (Partial Assignment)
#4  LIPND  License Assigned (Partition/Disaggregation)
#5 LIPPND License Issued from a Partial/P&D Assignment
#6  LIPUR                               License Purged
#7 LIREIN                           License Reinstated
#8  LIREN                              License Renewed

或者我们可以使用,替换第一个空格，然后将read.csv与sep=','一起使用。

read.table(text=sub(' ', ',', readLines('pavel.txt')), sep=',')
#      V1                                           V2
#1  LIISS                               License Issued
#2  LIMOD                             License Modified
#3 LIPASS        License Assigned (Partial Assignment)
#4  LIPND  License Assigned (Partition/Disaggregation)
#5 LIPPND License Issued from a Partial/P&D Assignment
#6  LIPUR                               License Purged
#7 LIREIN                           License Reinstated
#8  LIREN                              License Renewed

如果我们使用的是Linux，则awk可以通过fread或data.table read.csv/read.table进行管道传输。

library(data.table)
fread("awk '{sub(\" \", \",\", $0)}1' pavel.txt", header=FALSE)
#      V1                                           V2
#1:  LIISS                               License Issued
#2:  LIMOD                             License Modified
#3: LIPASS        License Assigned (Partial Assignment)
#4:  LIPND  License Assigned (Partition/Disaggregation)
#5: LIPPND License Issued from a Partial/P&D Assignment
#6:  LIPUR                               License Purged
#7: LIREIN                           License Reinstated
#8:  LIREN                              License Renewed

Answer 2

您可以使用 stringi

中的stri_split_fixed()

library(stringi)
as.data.frame(stri_split_fixed(readLines("x.txt"), " ", n = 2, simplify = TRUE))
#       V1                                           V2
# 1  LIISS                               License Issued
# 2  LIMOD                             License Modified
# 3 LIPASS        License Assigned (Partial Assignment)
# 4  LIPND  License Assigned (Partition/Disaggregation)
# 5 LIPPND License Issued from a Partial/P&D Assignment
# 6  LIPUR                               License Purged
# 7 LIREIN                           License Reinstated
# 8  LIREN                              License Renewed

这里我们使用readLines()来读取文件（由"x.txt"显示）。然后stri_split_fixed()表示我们想要拆分空格，并希望返回n = 2列（因此只能在第一个空格上拆分）。 simplify = TRUE用于返回矩阵而不是列表。

数据： x.txt

writeLines("LIISS License Issued
LIMOD License Modified
LIPASS License Assigned (Partial Assignment)
LIPND License Assigned (Partition/Disaggregation)
LIPPND License Issued from a Partial/P&D Assignment
LIPUR License Purged
LIREIN License Reinstated
LIREN License Renewed", "x.txt")

R：阅读第一列，然后阅读其余部分

2 个答案: