我正在尝试将文本文件导入R,并将其与其他数据一起放入数据框中。
我的分隔符是"|"
,我的数据样本在这里:
|无痛办理登机手续。 AC上有3条腿:AC105,YYZ-YVR。宽敞干净的A321拥有梦幻般的船员。 AC33:YVR-SYD,
非常轻载,并有3个座位给自己。这个跨太平洋的一个非常热情和友好的船员照常
我每年要走几次的路线。提前20分钟抵达。预期的高水平服务来自
我们的旗舰航空公司加拿大航空Altitude Elite会员。
|我们最近从都柏林返回多伦多,然后返回温尼伯。除了由于受限制而将其切断
在多伦多的人员配置我们的航班很棒。由于在多伦多的匆忙,我们的一个随身携带的进入
货舱。当我们抵达温尼伯时,它住在多伦多,他们对温尼伯最有帮助和善良
机场,我们第二天接到3个电话,关于放错位置的包,它已送到
我们的家。我们非常感谢并且非常感谢我们所获得的服务,这是一项伟大的目标
美好的假期。
|多伦多飞往希思罗机场。航班比出路差得多。我们为退出座位支付了大笔额外费用
没有任何存储空间,甚至座位下也没有任何空间。荒谬。船员很穷,不友好。一
年长的男性工作人员非常态度,表现得好像他通过服务为每个人做了巨大的帮助
他们。一顿合理的晚餐,但早餐是一块香蕉面包。那就是它!最糟糕的航空早餐
我有。
正如您所看到的,有很多"|"
,但正如下面的屏幕截图所示,当我在R中导入数据时,它只将其分隔一次,而不是大约152次。
如何在数据框内的不同列中获取每个单独的文本?我想要一个长度为152而不是2的数据帧。
编辑:代码行是:
myData <- read.table("C:/Users/Norbert/Desktop/research/Important files/Airline Reviews/Reviews/air_can_Review.txt", sep="|",quote=NULL, comment='',fill = TRUE, header=FALSE)
length(myData)
[1] 2
class(myData)
[1] "data.frame"
str(myData)
'data.frame': 1244 obs. of 2 variables:
$ V1: Factor w/ 1093 levels "","'delayed' on departure (I reference flights between March 2014 and January 2015 in this regard: Denver, SFO,",..: 210 367 698 853 1 344 483 87 757 52 ...
$ V2: Factor w/ 154 levels ""," hotel","5/9/2014, LHR to Vancouver, AC855. 23/9/2014, Vancouver to LHR, AC854. For Economy the leg room was OK compared to",..: 1 1 1 1 78 1 1 1 1 1 ...
myDataFrame <- data.frame(text = myData, otherVar2 = 1, otherVar2 = "blue", stringsAsFactors = FALSE)
str(myDataFrame)
'data.frame': 531 obs. of 3 variables:
$ text : chr "BRU-YUL, May 26th, A330-300. Departed on-time, landed 30 minutes late due to strong winds, nice flight, food" "excellent, cabin-crew smiling and attentive except for one old lady throwing meal trays like boomerangs. Seat-" "pitch was very generous, comfortable seat, IFE a bit outdated but selection was Okay. Air Canadas problem is\nthat the new pro"| __truncated__ "" ...
$ otherVar2 : num 1 1 1 1 1 1 1 1 1 1 ...
$ otherVar2.1: chr "blue" "blue" "blue" "blue" ...
length(myDataFrame)
[1] 3
答案 0 :(得分:1)
更好的阅读方法是使用scan()
,然后将其与其他变量放在一个数据框中(这里我刚刚做了一些)。请注意,我删除了起始“|”后,将上面的文字粘贴到名为sample.txt
的文件中。
myData <- scan("sample.txt", what = "character", sep = "|")
myDataFrame <- data.frame(text = myData, otherVar2 = 1, otherVar2 = "blue",
stringsAsFactors = FALSE)
str(myDataFrame)
## 'data.frame': 3 obs. of 3 variables:
## $ text : chr "Painless check-in. Two legs of 3 on AC: AC105, YYZ-YVR. Roomy and clean A321 with fantastic crew. AC33: YVR-SYD, very light loa"| __truncated__ "We recently returned from Dublin to Toronto, then on to Winnipeg. Other than cutting it close due to limited staffing in Toront"| __truncated__ "Flew Toronto to Heathrow. Much worse flight than on the way out. We paid a hefty extra fee for exit seats which had no storage "| __truncated__
## $ otherVar2 : num 1 1 1
## $ otherVar2.1: Factor w/ 1 level "blue": 1 1 1
otherVar1
,otherVar2
只是您自己变量的占位符,正如您所说,您想要一个包含其他变量的data.frame。我选择了一个整数变量和一个文本变量,通过指定一个值,它可以循环使用数据集中的所有观察值(在示例中为3)。
我意识到你的问题是如何将每个文本放在不同的列中,但这不是使用data.frame的好方法,因为data.frames旨在将变量保存在列中。 (每列有一个文本,您无法添加其他变量。)
如果确实想要这样做,则必须在转置数据后强制转换数据,如下所示:
myDataFrame <- as.data.frame(t(data.frame(text = myData, stringsAsFactors = FALSE)), stringsAsFactors = FALSE)
str(myDataFrame)
## 'data.frame': 1 obs. of 3 variables:
## $ V1: chr "Painless check-in. Two legs of 3 on AC: AC105, YYZ-YVR. Roomy and clean A321 with fantastic crew. AC33: YVR-SYD, very light loa"| __truncated__
## $ V2: chr "We recently returned from Dublin to Toronto, then on to Winnipeg. Other than cutting it close due to limited staffing in Toront"| __truncated__
## $ V3: chr "Flew Toronto to Heathrow. Much worse flight than on the way out. We paid a hefty extra fee for exit seats which had no storage "| __truncated__
length(myDataFrame)
## [1] 3
“实际上是香蕉面包”?绝对是经济舱。