我是R的菜鸟。我正在Windows中运行R Studio,我有一段时间试图通过以下read.table命令了解正在发生的事情。
continents=read.table("country2continent.tsv",sep="\t",
col.names=c("Country","Continent"),fileEncoding = "UTF-8",strip.white = TRUE)
问题:
如果我尝试使用“continents $ Country”命令在命令行上打印一列数据,则数据完全出现乱码。我检查了乱码数据,发现一些特殊字符,如“\ t”嵌入。我该怎么做才能摆脱造成问题的特殊字符?
如果我在R Studio中查看大陆数据框,它几乎看起来是正确的。我说几乎是因为检查R数据框显示第61行有问题。它应该是“科特迪瓦非洲”,但实际上是“非洲科特迪瓦”。在这一行(第61行)中,科特迪瓦缺少撇号,科特迪瓦和非洲之间有一个标签。在“科特迪瓦非洲”之后还有许多国家/大陆对没有自己的行。关于如何解决这个问题的任何建议?
根据rawr的要求,这里有一个样本数据片段,包括有问题的第61行:
Algeria Africa
Angola Africa
Benin Africa
Botswana Africa
Burkina Faso Africa
Burundi Africa
Cameroon Africa
Cape Verde Africa
Central African Republic Africa
Chad Africa
Comoros Africa
Congo - Brazzaville Africa
Congo - Kinshasa Africa
Côte d’Ivoire Africa
Djibouti Africa
Egypt Africa
Equatorial Guinea Africa
Eritrea Africa
Ethiopia Africa
Gabon Africa
Gambia Africa
Ghana Africa
Guinea Africa
Guinea-Bissau Africa
Kenya Africa
Lesotho Africa
Liberia Africa
Libya Africa
Madagascar Africa
Malawi Africa
Mali Africa
Mauritania Africa
Mauritius Africa
Mayotte Africa
Morocco Africa
Mozambique Africa
Namibia Africa
Niger Africa
Nigeria Africa
Rwanda Africa
Réunion Africa
Saint Helena Africa
Senegal Africa
Seychelles Africa
Sierra Leone Africa
Somalia Africa
South Africa Africa
Sudan Africa
Swaziland Africa
São Tomé and Príncipe Africa
Tanzania Africa
Togo Africa
Tunisia Africa
Uganda Africa
Western Sahara Africa
Zambia Africa
Zimbabwe Africa
Eritrea and Ethiopia Africa
South Sudan Africa
Sao Tome and Principe Africa
Cote d'Ivoire Africa
Reunion Africa
Congo, Dem. Rep. Africa
Congo, Rep. Africa
Anguilla Americas
Antigua and Barbuda Americas
Argentina Americas
Aruba Americas
Bahamas Americas
Barbados Americas
Belize Americas
Bermuda Americas
Bolivia Americas
Brazil Americas
British Virgin Islands Americas
Canada Americas
Cayman Islands Americas
Chile Americas
Colombia Americas
Costa Rica Americas
Cuba Americas
Dominica Americas
Dominican Republic Americas
Ecuador Americas
El Salvador Americas
Falkland Islands Americas
French Guiana Americas
Greenland Americas
Grenada Americas
Guadeloupe Americas
Guatemala Americas
Guyana Americas
Haiti Americas
Honduras Americas
Jamaica Americas
Martinique Americas
Mexico Americas
Montserrat Americas
Netherlands Antilles Americas
Nicaragua Americas
Panama Americas
Paraguay Americas
Peru Americas
Puerto Rico Americas
St. Barthélemy Americas
St. Kitts and Nevis Americas
St. Lucia Americas
St. Martin Americas
St. Pierre and Miquelon Americas
St. Vincent and the Grenadines Americas
Suriname Americas
Trinidad and Tobago Americas
Turks and Caicos Islands Americas
Virgin Islands (U.S.) Americas
United States Americas
Uruguay Americas
Venezuela Americas
St.-Pierre-et-Miquelon Americas
St. Helena Americas
Sint Maarten (Dutch part) Americas
Falkland Is (Malvinas) Americas
Curaçao Americas
Pitcairn Americas
Cocos Island Americas
Afghanistan Asia
Armenia Asia
Azerbaijan Asia
Bahrain Asia
Bangladesh Asia
Bhutan Asia
Brunei Asia
Cambodia Asia
China Asia
Cyprus Asia
Georgia Asia
Hong Kong, China Asia
India Asia
Indonesia Asia
Iran Asia
Iraq Asia
Israel Asia
Japan Asia
Jordan Asia
Kazakhstan Asia
Kuwait Asia
Kyrgyzstan Asia
Laos Asia
Lebanon Asia
Macao, China Asia
Malaysia Asia
Maldives Asia
Mongolia Asia
Myanmar [Burma] Asia
Nepal Asia
Neutral Zone Asia
North Korea Asia
Oman Asia
Pakistan Asia
West Bank and Gaza Asia
People's Democratic Republic of Yemen Asia
Philippines Asia
Qatar Asia
Saudi Arabia Asia
Singapore Asia
South Korea Asia
Sri Lanka Asia
Syria Asia
Taiwan Asia
Tajikistan Asia
Thailand Asia
Timor-Leste Asia
Turkey Asia
Turkmenistan Asia
United Arab Emirates Asia
Uzbekistan Asia
Vietnam Asia
Yemen Asia
Myanmar Asia
Lao Asia
United Korea (former) Asia
South Yemen (former) Asia
North Yemen (former) Asia
Albania Europe
Andorra Europe
Austria Europe
Belarus Europe
Belgium Europe
Bosnia and Herzegovina Europe
Bulgaria Europe
Croatia Europe
Cyprus Europe
Czech Republic Europe
Denmark Europe
East Germany Europe
Estonia Europe
Faroe Islands Europe
Finland Europe
France Europe
Germany Europe
Gibraltar Europe
Greece Europe
Guernsey Europe
Hungary Europe
Iceland Europe
Ireland Europe
Isle of Man Europe
Italy Europe
Jersey Europe
Latvia Europe
Liechtenstein Europe
Lithuania Europe
Luxembourg Europe
Macedonia Europe
Malta Europe
Metropolitan France Europe
Moldova Europe
Monaco Europe
Montenegro Europe
Netherlands Europe
Norway Europe
Poland Europe
Portugal Europe
Romania Europe
Russia Europe
San Marino Europe
Serbia Europe
Serbia and Montenegro Europe
Slovakia Europe
Slovenia Europe
Spain Europe
Svalbard and Jan Mayen Europe
Sweden Europe
Switzerland Europe
Ukraine Europe
USSR Europe
United Kingdom Europe
Vatican City Europe
Åland Islands Europe
Åland Europe
West Germany Europe
Yugoslavia Europe
Serbia excluding Kosova Europe
Serbia excluding Kosovo Europe
Slovak Republic Europe
Svalbard Europe
Kosovo Europe
Kyrgyz Republic Europe
Czechoslovakia Europe
Macedonia Europe
Macedonia, FYR Europe
Channel Islands Europe
Faeroe Islands Europe
Holy See Europe
Akrotiri and Dhekelia Europe
American Samoa Oceania
Antarctica Oceania
Australia Oceania
Bouvet Island Oceania
British Indian Ocean Territory Oceania
Christmas Island Oceania
Cocos [Keeling] Islands Oceania
Cook Islands Oceania
Fiji Oceania
French Polynesia Oceania
French Southern Territories Oceania
Guam Oceania
Heard Island and McDonald Islands Oceania
Kiribati Oceania
Marshall Islands Oceania
Micronesia Oceania
Nauru Oceania
New Caledonia Oceania
New Zealand Oceania
Niue Oceania
Norfolk Island Oceania
Northern Mariana Islands Oceania
Palau Oceania
Papua New Guinea Oceania
Pitcairn Islands Oceania
Samoa Oceania
Solomon Islands Oceania
South Georgia and the South Sandwich Islands Oceania
Tokelau Oceania
Tonga Oceania
Tuvalu Oceania
U.S. Minor Outlying Islands Oceania
Vanuatu Oceania
Wallis et Futuna Oceania
Micronesia, Fed. Sts. Oceania
Cook Is Oceania
答案 0 :(得分:2)
我刚刚将您的数据复制到名为countries.tsv
的文本文件中,并运行以下代码。可能有一种方法可以直接使用read.table
,但这对我来说更容易。
## read in each line of data as a character string
rl <- readLines('~/desktop/countries.tsv')
## this will separate the last word (continent) from the rest of the string
## so this assumes that the second column will _only_ be one word
## (.*) to 1st capture group any character any number of times
## \\s+ followed by one or more white spaces
## ([a-z]+)$ to 2nd capture group, only take letters a-z one or more times
## up to the end of the line $
## \\1;\\2 take the two capture groups and separate them with semicolon
txt <- gsub('(.*)\\s+([a-z]+)$', '\\1;\\2', rl, ignore.case = TRUE)
txt[c(1:5, 60:62)]
# [1] "Algeria;Africa" "Angola ;Africa"
# [3] "Benin ;Africa" "Botswana ;Africa"
# [5] "Burkina Faso ;Africa" "Sao Tome and Principe ;Africa"
# [7] "Cote d'Ivoire ;Africa" "Reunion;Africa"
现在我们有一个以分号分隔的字符串向量,我们可以非常直接地使用text=
中的read.table
。请注意,由于您有一些不规则的引号,例如您指出的第61行,我们也会使用quote = ''
dd <- read.table(text = txt, sep = ';', quote = '', stringsAsFactors = FALSE,
col.names = c("Country","Continent"), strip.white = TRUE)
# 'data.frame': 290 obs. of 2 variables:
# $ Country : chr "Algeria" "Angola" "Benin" "Botswana" ...
# $ Continent: chr "Africa" "Africa" "Africa" "Africa" ...
dd[c(1:5, 60:62), ]
# Country Continent
# 1 Algeria Africa
# 2 Angola Africa
# 3 Benin Africa
# 4 Botswana Africa
# 5 Burkina Faso Africa
# 60 Sao Tome and Principe Africa
# 61 Cote d'Ivoire Africa
# 62 Reunion Africa
答案 1 :(得分:1)
如果您不经常这样做,我建议您下载该文件并对其进行编辑以生成标准.csv
格式并处理该文件。
您可以通过将URL放入浏览器来下载文件。有两列由tab
分隔。在每行的开头和结尾添加双引号,并将选项卡更改为","
。将文件类型从.tsv
更改为.csv
。文件UTF-8
并不明显。
确定我将您的文件复制到我的HD并在RGUI中使用此代码
这对我有用:
mytable <- read.table("C:/Users/Philip/Downloads/country2continent.tsv",sep="\t",header=FALSE)
> head(mytable)
V1 V2
1 Algeria Africa
2 Angola Africa
3 Benin Africa
4 Botswana Africa
5 Burkina Faso Africa
6 Burundi Africa