从R中的字符串向量中提取城市

时间:2018-01-12 10:33:26

标签: r geocoding

我的数据集db中有一个列,比如db $ affiliation,它看起来像:

**db$affiliation**
[1] "[SCOTT, ALLEN J.] UNIV CALIF LOS ANGELES, DEPT GEOG, LOS ANGELES, CA 90095 USA"                               
[2] "[VAN DUINEN, RIANNE; VAN DER VEEN, ANNE] UNIV TWENTE, DEPT WATER ENGN & MANAGEMENT, DRIENERLOLAAN 5,POB 217, NL-7500 AE ENSCHEDE, NETHERLANDS."                                                
[3] "[ANANTSUKSOMSRI, SUTEE] CHULALONGKORN UNIV, FAC ARCHITECTURE, BANGKOK, THAILAND."   
[4] ...

我想在同一数据集中创建一个列,其中只包含db $ affiliation中的城市名称,例如

 **db$cities**
 [1] LOS ANGELES
 [2] TWENTE
 [3] BANGKOK
 [4] ...

如果有多个城市名称可用,我希望命令只返回最后一个城市名称,如果没有可用的城市名称,我希望有NA。我怎么能这样做?

我认为我可以在world.cities$name包中的data(world.cities)中使用maps,但我无法弄清楚如何使用db$affiliation <- gsub("\\[[^\\]]*\\]", "", db$affiliation, perl=TRUE) # remove content within brackets db$affiliation[2] # check the separator db <- cSplit(db, 'affiliation', sep=c(", "), type.convert=FALSE) # split after comma

我甚至尝试拆分db $ affiliation列,例如:

    affiliation_1            affiliation_2                  affiliation_3 
[1] UNIV CALIF LOS ANGELES   DEPT GEOG                      LOS ANGELES  
[2] UNIV TWENTE              DEPT WATER ENGN & MANAGEMENT   DRIENERLOLAAN            
[3] CHULALONGKORN UNIV       FAC ARCHITECTURE               BANGKOK 

结果(我在affiliation_3之后将其截断):

db$cities <- lapply(db$affiliation_1, function(x)x[which(x %in% world.cities$name)])

然后通过:

package producer.serialized.avro;

import org.apache.avro.Schema;
import org.apache.avro.generic.GenericData;
import org.apache.avro.generic.GenericRecord;
import org.apache.kafka.clients.producer.KafkaProducer;
import org.apache.kafka.clients.producer.ProducerConfig;
import org.apache.kafka.clients.producer.ProducerRecord;

import java.util.Properties;


public class Sender4 {

    public static void main(String[] args) {

        String flightSchema = "{\"type\":\"record\"," + "\"name\":\"Flight\","

                + "\"fields\":[{\"name\":\"flight_id\",\"type\":\"string\"},{\"name\":\"flight_to\",\"type\":\"string\"},{\"name\":\"flight_from\",\"type\":\"string\"}]}";                

        Properties props = new Properties();

        props.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, "192.168.0.1:9092");
        props.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG,io.confluent.kafka.serializers.KafkaAvroSerializer.class);
        props.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG,io.confluent.kafka.serializers.KafkaAvroSerializer.class);    
        props.put("schema.registry.url", "http://192.168.0.1:8081");            

        KafkaProducer producer = new KafkaProducer(props);    

        Schema.Parser parser = new Schema.Parser();

        Schema schema = parser.parse(flightSchema);            

        GenericRecord avroRecord = new GenericData.Record(schema);

        avroRecord.put("flight_id", "myflight");
        avroRecord.put("flight_to", "QWE");
        avroRecord.put("flight_from", "RTY");    

        ProducerRecord<String, GenericRecord> record = new ProducerRecord<>("topic9",avroRecord);

        producer.send(record);
    }
}

但我得到一个空栏。

感谢您的帮助!

3 个答案:

答案 0 :(得分:2)

您的示例字符串中有许多城市,因此如果您仍想要获取最后一个城市,则可能需要重新考虑。如果在affiliation列中找到多个城市。

library(maps)
data(world.cities)

#sample data
df <- data.frame(affiliation = c("[SCOTT, ALLEN J.] UNIV CALIF LOS ANGELES, DEPT GEOG, LOS ANGELES, CA 90095 USA",
                                 "[VAN DUINEN, RIANNE; VAN DER VEEN, ANNE] UNIV TWENTE, DEPT WATER ENGN & MANAGEMENT, DRIENERLOLAAN 5,POB 217, NL-7500 AE ENSCHEDE, NETHERLANDS.",
                                 "[ANANTSUKSOMSRI, SUTEE] CHULALONGKORN UNIV, FAC ARCHITECTURE, BANGKOK, THAILAND.",
                                 "Prem"), stringsAsFactors = F)

#fetch city and it's respective country from 'affiliation' column
cities_country <- lapply(gsub("\\[|\\]|[,;]|\\.","",df$affiliation), function(x) 
  paste(as.character(world.cities$name[sapply(world.cities$name, grepl, x, ignore.case=T)]),
        as.character(world.cities$country.etc[sapply(world.cities$name, grepl, x, ignore.case=T)]),
        sep="_"))
df$cities_country <- lapply(cities_country, function(x) if(identical(x, character(0))) NA_character_ else x)
df

输出是:

affiliation
1                                                                 [SCOTT, ALLEN J.] UNIV CALIF LOS ANGELES, DEPT GEOG, LOS ANGELES, CA 90095 USA
2 [VAN DUINEN, RIANNE; VAN DER VEEN, ANNE] UNIV TWENTE, DEPT WATER ENGN & MANAGEMENT, DRIENERLOLAAN 5,POB 217, NL-7500 AE ENSCHEDE, NETHERLANDS.
3                                                               [ANANTSUKSOMSRI, SUTEE] CHULALONGKORN UNIV, FAC ARCHITECTURE, BANGKOK, THAILAND.
4                                                                                                                                           Prem
                                                                                                                                                                                                                                                                                            cities_country
1                                                                      Al_Norway, Alle_Switzerland, Allen_Philippines, Allen_USA, Angeles_Costa Rica, Angeles_Philippines, Cali_Colombia, Cot_Costa Rica, Li_Norway, Los Angeles_Chile, Los Angeles_USA, Os_Kyrgyzstan, Os_Norway, U_Micronesia, Usa_Japan
2 Ae_Marshall Islands, Ede_Netherlands, Ede_Nigeria, Enschede_Netherlands, Hede_China, Ine_Marshall Islands, Laa_Austria, Lola_Guinea, Man_Ivory Coast, Mana_French Guiana, Manage_Belgium, Nagem_Luxembourg, Ob_Russia, Ola_Panama, Po_Burkina Faso, U_Micronesia, Van_Turkey, Wa_Ghana, We_New Caledonia
3                                                                                                                                     Aila_Estonia, Al_Norway, Anan_Japan, Ba_Fiji, Bangkok_Thailand, Hit_Iraq, Ila_Nigeria, Ilan_Taiwan, Long_Thailand, Nan_Thailand, Tsu_Japan, U_Micronesia, Ula_Turkey
4                                                                                                                                                                                                                                                                                                       NA

注意,在上面的输出中,我保留了所有出现的城市,为方便起见,还将它们与各自的国家一起加上)

答案 1 :(得分:1)

从您显示的几行开始看起来您可以执行以下操作(请注意您错过了对齐外壳):

tmpVec <- sapply(strsplit(db$affiliation, split = ","), function(x) {
  cleanVec <- toupper(trimws(x))
  cleanVec[max(which(cleanVec %in% toupper(maps::world.cities$name)))]
})

或者在函数中加入更多代码以避免丑陋的警告。

答案 2 :(得分:1)

让我留下解决方案的一部分。据我自己的研究可以看出,方括号中的字母似乎表示个人姓名。例如,我发现Sutee Anantsuksomsri是一个实际名称。这一观察结果表明我们可能希望删除括号中的文本。

删除方括号中的文本后,我使用tidytext包中的unnest_tokens()分割单词。请注意,该函数将所有字母转换为小写字母。如果您不喜欢它,可以通过指定to_lower = FALSE来更改它。首先,我将每个城市名称分成单词。我还为每个城市分配了一个身份证号码。其次,我清理了你的数据。正如我之前所说,我使用gsub()删除了方括号中的文本。然后,我将unnest_tokens()应用于数据。我使用citiesfilter()中的单词来对单词进行子集化。我们得出的结果如下。显然,你还有很多工作要做。我保留下面的采样数据mydf。我希望你能从这里继续前进。

data(world.cities)

cities <- world.cities %>%
          mutate(id = 1:n()) %>%
          unnest_tokens(input = name, output = word, token = "words")

temp <- mydf %>%
        mutate(affiliation = gsub(x = affiliation, pattern = "\\[.*\\]", replacement = "")) %>%          
        unnest_tokens(input = affiliation, output = word, token = "words") %>%
        filter(word %in% cities$word)


   id     word
1   1      los
2   1  angeles
3   1      los
4   1  angeles
5   1       ca
6   1      usa
7   2    water
8   2       ae
9   2 enschede
10  3  bangkok

数据

mydf <- structure(list(id = 1:3, affiliation = c("[SCOTT, ALLEN J.] UNIV CALIF LOS ANGELES, DEPT GEOG, LOS ANGELES, CA 90095 USA", 
"[VAN DUINEN, RIANNE; VAN DER VEEN, ANNE] UNIV TWENTE, DEPT WATER ENGN & MANAGEMENT, DRIENERLOLAAN 5,POB 217, NL-7500 AE ENSCHEDE, NETHERLANDS.", 
"[ANANTSUKSOMSRI, SUTEE] CHULALONGKORN UNIV, FAC ARCHITECTURE, BANGKOK, THAILAND."
)), .Names = c("id", "affiliation"), row.names = c(NA, -3L), class = "data.frame")