Question

我正在尝试将文本文件导入R，并将其与其他数据一起放入数据框中。

我的分隔符是"|"，我的数据样本在这里：

|无痛办理登机手续。 AC上有3条腿：AC105，YYZ-YVR。宽敞干净的A321拥有梦幻般的船员。 AC33：YVR-SYD，非常轻载，并有3个座位给自己。这个跨太平洋的一个非常热情和友好的船员照常我每年要走几次的路线。提前20分钟抵达。预期的高水平服务来自我们的旗舰航空公司加拿大航空Altitude Elite会员。 |我们最近从都柏林返回多伦多，然后返回温尼伯。除了由于受限制而将其切断在多伦多的人员配置我们的航班很棒。由于在多伦多的匆忙，我们的一个随身携带的进入货舱。当我们抵达温尼伯时，它住在多伦多，他们对温尼伯最有帮助和善良机场，我们第二天接到3个电话，关于放错位置的包，它已送到我们的家。我们非常感谢并且非常感谢我们所获得的服务，这是一项伟大的目标美好的假期。 |多伦多飞往希思罗机场。航班比出路差得多。我们为退出座位支付了大笔额外费用没有任何存储空间，甚至座位下也没有任何空间。荒谬。船员很穷，不友好。一年长的男性工作人员非常态度，表现得好像他通过服务为每个人做了巨大的帮助他们。一顿合理的晚餐，但早餐是一块香蕉面包。那就是它！最糟糕的航空早餐我有。 enter image description here

正如您所看到的，有很多"|"，但正如下面的屏幕截图所示，当我在R中导入数据时，它只将其分隔一次，而不是大约152次。

如何在数据框内的不同列中获取每个单独的文本？我想要一个长度为152而不是2的数据帧。

编辑：代码行是：

  myData <- read.table("C:/Users/Norbert/Desktop/research/Important files/Airline Reviews/Reviews/air_can_Review.txt", sep="|",quote=NULL, comment='',fill = TRUE, header=FALSE)

length(myData)
[1] 2
class(myData)
[1] "data.frame"
str(myData)
'data.frame':   1244 obs. of  2 variables:
 $ V1: Factor w/ 1093 levels "","'delayed' on departure (I reference flights between March 2014 and January 2015 in this regard: Denver, SFO,",..: 210 367    698 853 1 344 483 87 757 52 ...
 $ V2: Factor w/ 154 levels ""," hotel","5/9/2014, LHR to Vancouver, AC855. 23/9/2014, Vancouver to LHR, AC854. For Economy the leg room was OK compared to",..: 1 1 1 1 78 1 1 1 1 1 ...

 myDataFrame <- data.frame(text = myData, otherVar2 = 1, otherVar2 = "blue", stringsAsFactors = FALSE)
 str(myDataFrame)
 'data.frame':   531 obs. of  3 variables:
  $ text       : chr  "BRU-YUL, May 26th, A330-300. Departed on-time, landed 30 minutes late due to strong winds, nice flight, food" "excellent, cabin-crew smiling and attentive except for one old lady throwing meal trays like boomerangs. Seat-" "pitch was very generous, comfortable seat,  IFE a bit outdated but selection was Okay. Air Canadas problem is\nthat the new pro"| __truncated__ "" ...
$ otherVar2  : num  1 1 1 1 1 1 1 1 1 1 ...
$ otherVar2.1: chr  "blue" "blue" "blue" "blue" ...

length(myDataFrame)
[1] 3

Answer 1

更好的阅读方法是使用scan()，然后将其与其他变量放在一个数据框中（这里我刚刚做了一些）。请注意，我删除了起始“|”后，将上面的文字粘贴到名为sample.txt的文件中。

myData <- scan("sample.txt", what = "character", sep = "|")
myDataFrame <- data.frame(text = myData, otherVar2 = 1, otherVar2 = "blue",
                          stringsAsFactors = FALSE)
str(myDataFrame)
## 'data.frame':    3 obs. of  3 variables:
##  $ text       : chr  "Painless check-in. Two legs of 3 on AC: AC105, YYZ-YVR. Roomy and clean A321 with fantastic crew. AC33: YVR-SYD, very light loa"| __truncated__ "We recently returned from Dublin to Toronto, then on to Winnipeg. Other than cutting it close due to limited staffing in Toront"| __truncated__ "Flew Toronto to Heathrow. Much worse flight than on the way out. We paid a hefty extra fee for exit seats which had no storage "| __truncated__
##  $ otherVar2  : num  1 1 1
##  $ otherVar2.1: Factor w/ 1 level "blue": 1 1 1

otherVar1，otherVar2只是您自己变量的占位符，正如您所说，您想要一个包含其他变量的data.frame。我选择了一个整数变量和一个文本变量，通过指定一个值，它可以循环使用数据集中的所有观察值（在示例中为3）。

我意识到你的问题是如何将每个文本放在不同的列中，但这不是使用data.frame的好方法，因为data.frames旨在将变量保存在列中。（每列有一个文本，您无法添加其他变量。）

如果确实想要这样做，则必须在转置数据后强制转换数据，如下所示：

myDataFrame <- as.data.frame(t(data.frame(text = myData, stringsAsFactors = FALSE)), stringsAsFactors = FALSE)
str(myDataFrame)
## 'data.frame':    1 obs. of  3 variables:
##  $ V1: chr "Painless check-in. Two legs of 3 on AC: AC105, YYZ-YVR. Roomy and clean A321 with fantastic crew. AC33: YVR-SYD, very light loa"| __truncated__
##  $ V2: chr "We recently returned from Dublin to Toronto, then on to Winnipeg. Other than cutting it close due to limited staffing in Toront"| __truncated__
##  $ V3: chr "Flew Toronto to Heathrow. Much worse flight than on the way out. We paid a hefty extra fee for exit seats which had no storage "| __truncated__
length(myDataFrame)
## [1] 3

“实际上是香蕉面包”？绝对是经济舱。

数据导入R中的分隔符问题

1 个答案: