R:read.table在制表符分隔的数据文件上产生意外结果

时间:2016-02-07 21:59:41

标签: r

我是R的菜鸟。我正在Windows中运行R Studio,我有一段时间试图通过以下read.table命令了解正在发生的事情。

continents=read.table("country2continent.tsv",sep="\t",
  col.names=c("Country","Continent"),fileEncoding = "UTF-8",strip.white = TRUE)

问题:

  1. 如果我尝试使用“continents $ Country”命令在命令行上打印一列数据,则数据完全出现乱码。我检查了乱码数据,发现一些特殊字符,如“\ t”嵌入。我该怎么做才能摆脱造成问题的特殊字符?

  2. 如果我在R Studio中查看大陆数据框,它几乎看起来是正确的。我说几乎是因为检查R数据框显示第61行有问题。它应该是“科特迪瓦非洲”,但实际上是“非洲科特迪瓦”。在这一行(第61行)中,科特迪瓦缺少撇号,科特迪瓦和非洲之间有一个标签。在“科特迪瓦非洲”之后还有许多国家/大陆对没有自己的行。关于如何解决这个问题的任何建议?

  3. 根据rawr的要求,这里有一个样本数据片段,包括有问题的第61行:

    Algeria Africa
    Angola  Africa
    Benin   Africa
    Botswana    Africa
    Burkina Faso    Africa
    Burundi Africa
    Cameroon    Africa
    Cape Verde  Africa
    Central African Republic    Africa
    Chad    Africa
    Comoros Africa
    Congo - Brazzaville Africa
    Congo - Kinshasa    Africa
    Côte d’Ivoire   Africa
    Djibouti    Africa
    Egypt   Africa
    Equatorial Guinea   Africa
    Eritrea Africa
    Ethiopia    Africa
    Gabon   Africa
    Gambia  Africa
    Ghana   Africa
    Guinea  Africa
    Guinea-Bissau   Africa
    Kenya   Africa
    Lesotho Africa
    Liberia Africa
    Libya   Africa
    Madagascar  Africa
    Malawi  Africa
    Mali    Africa
    Mauritania  Africa
    Mauritius   Africa
    Mayotte Africa
    Morocco Africa
    Mozambique  Africa
    Namibia Africa
    Niger   Africa
    Nigeria Africa
    Rwanda  Africa
    Réunion Africa
    Saint Helena    Africa
    Senegal Africa
    Seychelles  Africa
    Sierra Leone    Africa
    Somalia Africa
    South Africa    Africa
    Sudan   Africa
    Swaziland   Africa
    São Tomé and Príncipe   Africa
    Tanzania    Africa
    Togo    Africa
    Tunisia Africa
    Uganda  Africa
    Western Sahara  Africa
    Zambia  Africa
    Zimbabwe    Africa
    Eritrea and Ethiopia    Africa
    South Sudan Africa
    Sao Tome and Principe   Africa
    Cote d'Ivoire   Africa
    Reunion Africa
    Congo, Dem. Rep.    Africa
    Congo, Rep. Africa
    Anguilla    Americas
    Antigua and Barbuda Americas
    Argentina   Americas
    Aruba   Americas
    Bahamas Americas
    Barbados    Americas
    Belize  Americas
    Bermuda Americas
    Bolivia Americas
    Brazil  Americas
    British Virgin Islands  Americas
    Canada  Americas
    Cayman Islands  Americas
    Chile   Americas
    Colombia    Americas
    Costa Rica  Americas
    Cuba    Americas
    Dominica    Americas
    Dominican Republic  Americas
    Ecuador Americas
    El Salvador Americas
    Falkland Islands    Americas
    French Guiana   Americas
    Greenland   Americas
    Grenada Americas
    Guadeloupe  Americas
    Guatemala   Americas
    Guyana  Americas
    Haiti   Americas
    Honduras    Americas
    Jamaica Americas
    Martinique  Americas
    Mexico  Americas
    Montserrat  Americas
    Netherlands Antilles    Americas
    Nicaragua   Americas
    Panama  Americas
    Paraguay    Americas
    Peru    Americas
    Puerto Rico Americas
    St. Barthélemy  Americas
    St. Kitts and Nevis Americas
    St. Lucia   Americas
    St. Martin  Americas
    St. Pierre and Miquelon Americas
    St. Vincent and the Grenadines  Americas
    Suriname    Americas
    Trinidad and Tobago Americas
    Turks and Caicos Islands    Americas
    Virgin Islands (U.S.)   Americas
    United States   Americas
    Uruguay Americas
    Venezuela   Americas
    St.-Pierre-et-Miquelon  Americas
    St. Helena  Americas
    Sint Maarten (Dutch part)   Americas
    Falkland Is (Malvinas)  Americas
    Curaçao Americas
    Pitcairn    Americas
    Cocos Island    Americas
    Afghanistan Asia
    Armenia Asia
    Azerbaijan  Asia
    Bahrain Asia
    Bangladesh  Asia
    Bhutan  Asia
    Brunei  Asia
    Cambodia    Asia
    China   Asia
    Cyprus  Asia
    Georgia Asia
    Hong Kong, China    Asia
    India   Asia
    Indonesia   Asia
    Iran    Asia
    Iraq    Asia
    Israel  Asia
    Japan   Asia
    Jordan  Asia
    Kazakhstan  Asia
    Kuwait  Asia
    Kyrgyzstan  Asia
    Laos    Asia
    Lebanon Asia
    Macao, China    Asia
    Malaysia    Asia
    Maldives    Asia
    Mongolia    Asia
    Myanmar [Burma] Asia
    Nepal   Asia
    Neutral Zone    Asia
    North Korea Asia
    Oman    Asia
    Pakistan    Asia
    West Bank and Gaza  Asia
    People's Democratic Republic of Yemen   Asia
    Philippines Asia
    Qatar   Asia
    Saudi Arabia    Asia
    Singapore   Asia
    South Korea Asia
    Sri Lanka   Asia
    Syria   Asia
    Taiwan  Asia
    Tajikistan  Asia
    Thailand    Asia
    Timor-Leste Asia
    Turkey  Asia
    Turkmenistan    Asia
    United Arab Emirates    Asia
    Uzbekistan  Asia
    Vietnam Asia
    Yemen   Asia
    Myanmar Asia
    Lao Asia
    United Korea (former)   Asia
    South Yemen (former)    Asia
    North Yemen (former)    Asia
    Albania Europe
    Andorra Europe
    Austria Europe
    Belarus Europe
    Belgium Europe
    Bosnia and Herzegovina  Europe
    Bulgaria    Europe
    Croatia Europe
    Cyprus  Europe
    Czech Republic  Europe
    Denmark Europe
    East Germany    Europe
    Estonia Europe
    Faroe Islands   Europe
    Finland Europe
    France  Europe
    Germany Europe
    Gibraltar   Europe
    Greece  Europe
    Guernsey    Europe
    Hungary Europe
    Iceland Europe
    Ireland Europe
    Isle of Man Europe
    Italy   Europe
    Jersey  Europe
    Latvia  Europe
    Liechtenstein   Europe
    Lithuania   Europe
    Luxembourg  Europe
    Macedonia   Europe
    Malta   Europe
    Metropolitan France Europe
    Moldova Europe
    Monaco  Europe
    Montenegro  Europe
    Netherlands Europe
    Norway  Europe
    Poland  Europe
    Portugal    Europe
    Romania Europe
    Russia  Europe
    San Marino  Europe
    Serbia  Europe
    Serbia and Montenegro   Europe
    Slovakia    Europe
    Slovenia    Europe
    Spain   Europe
    Svalbard and Jan Mayen  Europe
    Sweden  Europe
    Switzerland Europe
    Ukraine Europe
    USSR    Europe
    United Kingdom  Europe
    Vatican City    Europe
    Åland Islands   Europe
    Åland   Europe
    West Germany    Europe
    Yugoslavia  Europe
    Serbia excluding Kosova Europe
    Serbia excluding Kosovo Europe
    Slovak Republic Europe
    Svalbard    Europe
    Kosovo  Europe
    Kyrgyz Republic Europe
    Czechoslovakia  Europe
    Macedonia   Europe
    Macedonia, FYR  Europe
    Channel Islands Europe
    Faeroe Islands  Europe
    Holy See    Europe
    Akrotiri and Dhekelia   Europe
    American Samoa  Oceania
    Antarctica  Oceania
    Australia   Oceania
    Bouvet Island   Oceania
    British Indian Ocean Territory  Oceania
    Christmas Island    Oceania
    Cocos [Keeling] Islands Oceania
    Cook Islands    Oceania
    Fiji    Oceania
    French Polynesia    Oceania
    French Southern Territories Oceania
    Guam    Oceania
    Heard Island and McDonald Islands   Oceania
    Kiribati    Oceania
    Marshall Islands    Oceania
    Micronesia  Oceania
    Nauru   Oceania
    New Caledonia   Oceania
    New Zealand Oceania
    Niue    Oceania
    Norfolk Island  Oceania
    Northern Mariana Islands    Oceania
    Palau   Oceania
    Papua New Guinea    Oceania
    Pitcairn Islands    Oceania
    Samoa   Oceania
    Solomon Islands Oceania
    South Georgia and the South Sandwich Islands    Oceania
    Tokelau Oceania
    Tonga   Oceania
    Tuvalu  Oceania
    U.S. Minor Outlying Islands Oceania
    Vanuatu Oceania
    Wallis et Futuna    Oceania
    Micronesia, Fed. Sts.   Oceania
    Cook Is Oceania
    

2 个答案:

答案 0 :(得分:2)

我刚刚将您的数据复制到名为countries.tsv的文本文件中,并运行以下代码。可能有一种方法可以直接使用read.table,但这对我来说更容易。

## read in each line of data as a character string
rl <- readLines('~/desktop/countries.tsv')

## this will separate the last word (continent) from the rest of the string
## so this assumes that the second column will _only_ be one word

## (.*)        to 1st capture group any character any number of times
## \\s+        followed by one or more white spaces
## ([a-z]+)$   to 2nd capture group, only take letters a-z one or more times
##               up to the end of the line $

## \\1;\\2     take the two capture groups and separate them with semicolon
txt <- gsub('(.*)\\s+([a-z]+)$', '\\1;\\2', rl, ignore.case = TRUE)

txt[c(1:5, 60:62)]
# [1] "Algeria;Africa"                 "Angola ;Africa"                
# [3] "Benin  ;Africa"                 "Botswana   ;Africa"            
# [5] "Burkina Faso   ;Africa"         "Sao Tome and Principe  ;Africa"
# [7] "Cote d'Ivoire  ;Africa"         "Reunion;Africa"   

现在我们有一个以分号分隔的字符串向量,我们可以非常直接地使用text=中的read.table。请注意,由于您有一些不规则的引号,例如您指出的第61行,我们也会使用quote = ''

禁用引号
dd <- read.table(text = txt, sep = ';', quote = '', stringsAsFactors = FALSE,
                 col.names = c("Country","Continent"), strip.white = TRUE)

# 'data.frame': 290 obs. of  2 variables:
#   $ Country  : chr  "Algeria" "Angola" "Benin" "Botswana" ...
#   $ Continent: chr  "Africa" "Africa" "Africa" "Africa" ...

dd[c(1:5, 60:62), ]
#                  Country Continent
# 1                Algeria    Africa
# 2                 Angola    Africa
# 3                  Benin    Africa
# 4               Botswana    Africa
# 5           Burkina Faso    Africa
# 60 Sao Tome and Principe    Africa
# 61         Cote d'Ivoire    Africa
# 62               Reunion    Africa

答案 1 :(得分:1)

如果您不经常这样做,我建议您下载该文件并对其进行编辑以生成标准.csv格式并处理该文件。

您可以通过将URL放入浏览器来下载文件。有两列由tab分隔。在每行的开头和结尾添加双引号,并将选项卡更改为","。将文件类型从.tsv更改为.csv。文件UTF-8并不明显。

确定我将您的文件复制到我的HD并在RGUI中使用此代码

这对我有用:

mytable <- read.table("C:/Users/Philip/Downloads/country2continent.tsv",sep="\t",header=FALSE)  

> head(mytable)
            V1     V2
1      Algeria Africa
2       Angola Africa
3        Benin Africa
4     Botswana Africa
5 Burkina Faso Africa
6      Burundi Africa