Question

我有一个“\ t”分隔数据文件，如下所示：

Hotel       Price   Location
hotel1      100       A
hotel2      Unknown   B
hotel3      1,200     C
hotel4      <id=?h    B

在“价格”栏中，有些数字包含逗号，看起来像“1,200”。某些行的“价格”列混乱并包含“未知”或其他没有“\ t”且没有特定模式的内容。

如何读取此文件，删除乱搞“价格”的所有行，并删除所有数字中的逗号？我想得到的是以下内容：

Hotel       Price   Location
hotel1      100     A
hotel3      1200    C

我尝试过使用

price <- read.table("hotel.txt", header=TRUE, colClasses=c("Price"="integer"))

它不起作用，因为scan（）期望'整数'，但得到的东西不是整数。

有人可以帮忙吗？

提前致谢。

Answer 1

分2步：

## remove not numeric like Price
dat <- dat[grepl('[0-9]+',dat$Price),]
# Hotel Price Location
# 1 hotel1   100        A
# 3 hotel3 1,200        C

## convert price to numeric
dat$Price <- as.numeric(gsub(',','',dat$Price))

 Hotel Price Location
1 hotel1   100        A
3 hotel3  1200        C

其中dat是：

dat <- read.table(text='Hotel   Price   Location
hotel1  100 A
hotel2  Unknown B
hotel3  1,200   C
hotel4  <id=?h  B',header=TRUE)

如何读取列混乱的文件？

1 个答案: