在R中,我运行了一个代码来获取包含城市,国家和相应数字的两列数据框。
我在列上运行summary()
并将结果转换为数据框。
我正在努力将所有州合并为一个国家。例如,在下面的输出中,我想将所有美国州,城市组合成一个国家“美国”。我可以使用grep()
查找模式,然后使用一些包来组合在一起吗?请告知如何做到这一点。
location<-summary(pind$userLocation)
location<-as.data.frame(location)
location
数据:
location
271286
null 58145
Texas 1027
United States 900
USA 866
Paris 755
California 590
Canada 535
Florida 438
New York 392
Australia 379
London 375
Ohio 373
Michigan 354
Chicago, IL 335
Los Angeles, CA 323
Chicago 299
Colorado 275
New York, NY 275
North Carolina 271
Minnesota 259
Seattle, WA 254
Los Angeles 249
Indiana 247
Virginia 244
Wisconsin 231
Arizona 224
Atlanta, GA 221
Dallas, TX 220
Oregon 218
Georgia 204
Houston, TX 200
Oklahoma 200
Utah 198
Austin, TX 190
Pennsylvania 189
Illinois 187
San Diego, CA 184
Tennessee 182
UK 182
Missouri 181
Kentucky 173
San Francisco, CA 172
Louisiana 167
NYC 167
Alabama 163
Nashville, TN 157
Iowa 149
Boston, MA 148
Kansas 145
Southern California 144
Denver, CO 142
New Jersey 140
Sydney, Australia 138
South Carolina 134
Washington, DC 133
Maryland 128
Arkansas 127
Portland, OR 126
Phoenix, AZ 125
Atlanta 124
London, UK 124
Melbourne, Australia 123
Ontario, Canada 121
Seattle 121
Washington 121
Las Vegas, NV 116
New Zealand 116
United Kingdom 116
Brooklyn, NY 115
CA 110
Minneapolis, MN 109
Houston, Texas 105
NC 104
New York City 103
Toronto 103
Austin, Texas 101
Charlotte, NC 101
South Africa 100
Pittsburgh, PA 98
San Francisco 98
Vancouver, BC 95
Germany 94
Phoenix, Arizona 92
Barcelona 89
Dallas, Texas 89
Portland, Oregon 89
England 88
Idaho 86
. 83
San Diego 83
West Virginia 83
Nevada 82
The Netherlands 81
France 79
Raleigh, NC 78
Kansas City, MO 76
Massachusetts 75
US 75
答案 0 :(得分:2)
由于您的数据并不广泛,因此可以非常轻松地手动完成。我浏览了每条记录并确定了它所属的国家/地区,并添加了一个包含结果的新列。拥有国家/地区后,您可以使用aggregate()
获取总和:
location <- data.frame(location=c(271286,58145,1027,900,866,755,590,535,438,392,379,375,373,354,335,323,299,275,275,271,259,254,249,247,244,231,224,221,220,218,204,200,200,198,190,189,187,184,182,182,181,173,172,167,167,163,157,149,148,145,144,142,140,138,134,133,128,127,126,125,124,124,123,121,121,121,116,116,116,115,110,109,105,104,103,103,101,101,100,98,98,95,94,92,89,89,89,88,86,83,83,83,82,81,79,78,76,75,75),row.names=c('','null','Texas','United States','USA','Paris','California','Canada','Florida','New York','Australia','London','Ohio','Michigan','Chicago, IL','Los Angeles, CA','Chicago','Colorado','New York, NY','North Carolina','Minnesota','Seattle, WA','Los Angeles','Indiana','Virginia','Wisconsin','Arizona','Atlanta, GA','Dallas, TX','Oregon','Georgia','Houston, TX','Oklahoma','Utah','Austin, TX','Pennsylvania','Illinois','San Diego, CA','Tennessee','UK','Missouri','Kentucky','San Francisco, CA','Louisiana','NYC','Alabama','Nashville, TN','Iowa','Boston, MA','Kansas','Southern California','Denver, CO','New Jersey','Sydney, Australia','South Carolina','Washington, DC','Maryland','Arkansas','Portland, OR','Phoenix, AZ','Atlanta','London, UK','Melbourne, Australia','Ontario, Canada','Seattle','Washington','Las Vegas, NV','New Zealand','United Kingdom','Brooklyn, NY','CA','Minneapolis, MN','Houston, Texas','NC','New York City','Toronto','Austin, Texas','Charlotte, NC','South Africa','Pittsburgh, PA','San Francisco','Vancouver, BC','Germany','Phoenix, Arizona','Barcelona','Dallas, Texas','Portland, Oregon','England','Idaho','.','San Diego','West Virginia','Nevada','The Netherlands','France','Raleigh, NC','Kansas City, MO','Massachusetts','US'));
location$country <- factor(c(NA,NA,'United States','United States','United States','France','United States','Canada','United States','United States','Australia','United Kingdom','United States','United States','United States','United States','United States','United States','United States','United States','United States','United States','United States','United States','United States','United States','United States','United States','United States','United States','United States','United States','United States','United States','United States','United States','United States','United States','United States','United Kingdom','United States','United States','United States','United States','United States','United States','United States','United States','United States','United States','United States','United States','United States','Australia','United States','United States','United States','United States','United States','United States','United States','United Kingdom','Australia','Canada','United States','United States','United States','New Zealand','United Kingdom','United States','Canada','United States','United States','United States','United States','Canada','United States','United States','South Africa','United States','United States','Canada','Germany','United States','Spain','United States','United States','United Kingdom','United States',NA,'United States','United States','United States','Netherlands','France','United States','United States','United States','United States'));
aggregate(location~country,location,sum);
## country location
## 1 Australia 640
## 2 Canada 964
## 3 France 834
## 4 Germany 94
## 5 Netherlands 81
## 6 New Zealand 116
## 7 South Africa 100
## 8 Spain 89
## 9 United Kingdom 885
## 10 United States 15964
我使用NA,仅凭位置名称无法确定国家/地区;我引用了名为''
,'null'
和'.'
的三条记录。由于aggregate()
忽略组值为NA的记录,因此这些记录不包含在结果中。
答案 1 :(得分:0)
我不确定我是否理解这个问题,但我会试一试。
您希望为每个位置字符串标识它所属的国家/地区,然后将它们组合在一起并根据国家/地区组进行操作?
如果是这种情况,那么我们想到的是使用ggmap中使用谷歌地图API的地理编码功能,这只有在您没有进行过多次查询时才有意义。
require(dplyr)
require(ggmap)
MyGeoCode <- function(Location){
return(geocode(Location,output = "more")$country)
}
location$country <- sapply(location$location,MyGeoCode)
location <- location %>% group_by(country) %>% summarise(TotalPerCountry=sum(numbercolumn,na.rm = TRUE))
此示例假设您要对每个国家/地区的数字列求和,其他操作也遵循相同的方式。