数据导入R中的分隔符问题

时间:2015-06-01 02:00:04

标签: r dataframe delimiter data-import

我正在尝试将文本文件导入R,并将其与其他数据一起放入数据框中。

我的分隔符是"|",我的数据样本在这里:

|无痛办理登机手续。 AC上有3条腿:AC105,YYZ-YVR。宽敞干净的A321拥有梦幻般的船员。 AC33:YVR-SYD, 非常轻载,并有3个座位给自己。这个跨太平洋的一个非常热情和友好的船员照常 我每年要走几次的路线。提前20分钟抵达。预期的高水平服务来自 我们的旗舰航空公司加拿大航空Altitude Elite会员。 |我们最近从都柏林返回多伦多,然后返回温尼伯。除了由于受限制而将其切断 在多伦多的人员配置我们的航班很棒。由于在多伦多的匆忙,我们的一个随身携带的进入 货舱。当我们抵达温尼伯时,它住在多伦多,他们对温尼伯最有帮助和善良 机场,我们第二天接到3个电话,关于放错位置的包,它已送到 我们的家。我们非常感谢并且非常感谢我们所获得的服务,这是一项伟大的目标 美好的假期。 |多伦多飞往希思罗机场。航班比出路差得多。我们为退出座位支付了大笔额外费用 没有任何存储空间,甚至座位下也没有任何空间。荒谬。船员很穷,不友好。一 年长的男性工作人员非常态度,表现得好像他通过服务为每个人做了巨大的帮助 他们。一顿合理的晚餐,但早餐是一块香蕉面包。那就是它!最糟糕的航空早餐 我有。enter image description here

正如您所看到的,有很多"|",但正如下面的屏幕截图所示,当我在R中导入数据时,它只将其分隔一次,而不是大约152次。

如何在数据框内的不同列中获取每个单独的文本?我想要一个长度为152而不是2的数据帧。

编辑:代码行是:

  myData <- read.table("C:/Users/Norbert/Desktop/research/Important files/Airline Reviews/Reviews/air_can_Review.txt", sep="|",quote=NULL, comment='',fill = TRUE, header=FALSE)

length(myData)
[1] 2
class(myData)
[1] "data.frame"
str(myData)
'data.frame':   1244 obs. of  2 variables:
 $ V1: Factor w/ 1093 levels "","'delayed' on departure (I reference flights between March 2014 and January 2015 in this regard: Denver, SFO,",..: 210 367    698 853 1 344 483 87 757 52 ...
 $ V2: Factor w/ 154 levels ""," hotel","5/9/2014, LHR to Vancouver, AC855. 23/9/2014, Vancouver to LHR, AC854. For Economy the leg room was OK compared to",..: 1 1 1 1 78 1 1 1 1 1 ...

 myDataFrame <- data.frame(text = myData, otherVar2 = 1, otherVar2 = "blue", stringsAsFactors = FALSE)
 str(myDataFrame)
 'data.frame':   531 obs. of  3 variables:
  $ text       : chr  "BRU-YUL, May 26th, A330-300. Departed on-time, landed 30 minutes late due to strong winds, nice flight, food" "excellent, cabin-crew smiling and attentive except for one old lady throwing meal trays like boomerangs. Seat-" "pitch was very generous, comfortable seat,  IFE a bit outdated but selection was Okay. Air Canadas problem is\nthat the new pro"| __truncated__ "" ...
$ otherVar2  : num  1 1 1 1 1 1 1 1 1 1 ...
$ otherVar2.1: chr  "blue" "blue" "blue" "blue" ...

length(myDataFrame)
[1] 3

1 个答案:

答案 0 :(得分:1)

更好的阅读方法是使用scan(),然后将其与其他变量放在一个数据框中(这里我刚刚做了一些)。请注意,我删除了起始“|”后,将上面的文字粘贴到名为sample.txt的文件中。

myData <- scan("sample.txt", what = "character", sep = "|")
myDataFrame <- data.frame(text = myData, otherVar2 = 1, otherVar2 = "blue",
                          stringsAsFactors = FALSE)
str(myDataFrame)
## 'data.frame':    3 obs. of  3 variables:
##  $ text       : chr  "Painless check-in. Two legs of 3 on AC: AC105, YYZ-YVR. Roomy and clean A321 with fantastic crew. AC33: YVR-SYD, very light loa"| __truncated__ "We recently returned from Dublin to Toronto, then on to Winnipeg. Other than cutting it close due to limited staffing in Toront"| __truncated__ "Flew Toronto to Heathrow. Much worse flight than on the way out. We paid a hefty extra fee for exit seats which had no storage "| __truncated__
##  $ otherVar2  : num  1 1 1
##  $ otherVar2.1: Factor w/ 1 level "blue": 1 1 1

otherVar1otherVar2只是您自己变量的占位符,正如您所说,您想要一个包含其他变量的data.frame。我选择了一个整数变量和一个文本变量,通过指定一个值,它可以循环使用数据集中的所有观察值(在示例中为3)。

我意识到你的问题是如何将每个文本放在不同的列中,但这不是使用data.frame的好方法,因为data.frames旨在将变量保存在列中。 (每列有一个文本,您无法添加其他变量。)

如果确实想要这样做,则必须在转置数据后强制转换数据,如下所示:

myDataFrame <- as.data.frame(t(data.frame(text = myData, stringsAsFactors = FALSE)), stringsAsFactors = FALSE)
str(myDataFrame)
## 'data.frame':    1 obs. of  3 variables:
##  $ V1: chr "Painless check-in. Two legs of 3 on AC: AC105, YYZ-YVR. Roomy and clean A321 with fantastic crew. AC33: YVR-SYD, very light loa"| __truncated__
##  $ V2: chr "We recently returned from Dublin to Toronto, then on to Winnipeg. Other than cutting it close due to limited staffing in Toront"| __truncated__
##  $ V3: chr "Flew Toronto to Heathrow. Much worse flight than on the way out. We paid a hefty extra fee for exit seats which had no storage "| __truncated__
length(myDataFrame)
## [1] 3

“实际上是香蕉面包”?绝对是经济舱。