我的数据集db中有一个列,比如db $ affiliation,它看起来像:
**db$affiliation**
[1] "[SCOTT, ALLEN J.] UNIV CALIF LOS ANGELES, DEPT GEOG, LOS ANGELES, CA 90095 USA"
[2] "[VAN DUINEN, RIANNE; VAN DER VEEN, ANNE] UNIV TWENTE, DEPT WATER ENGN & MANAGEMENT, DRIENERLOLAAN 5,POB 217, NL-7500 AE ENSCHEDE, NETHERLANDS."
[3] "[ANANTSUKSOMSRI, SUTEE] CHULALONGKORN UNIV, FAC ARCHITECTURE, BANGKOK, THAILAND."
[4] ...
我想在同一数据集中创建一个列,其中只包含db $ affiliation中的城市名称,例如
**db$cities**
[1] LOS ANGELES
[2] TWENTE
[3] BANGKOK
[4] ...
如果有多个城市名称可用,我希望命令只返回最后一个城市名称,如果没有可用的城市名称,我希望有NA。我怎么能这样做?
我认为我可以在world.cities$name
包中的data(world.cities)
中使用maps
,但我无法弄清楚如何使用db$affiliation <- gsub("\\[[^\\]]*\\]", "", db$affiliation, perl=TRUE) # remove content within brackets
db$affiliation[2] # check the separator
db <- cSplit(db, 'affiliation', sep=c(", "), type.convert=FALSE) # split after comma
。
我甚至尝试拆分db $ affiliation列,例如:
affiliation_1 affiliation_2 affiliation_3
[1] UNIV CALIF LOS ANGELES DEPT GEOG LOS ANGELES
[2] UNIV TWENTE DEPT WATER ENGN & MANAGEMENT DRIENERLOLAAN
[3] CHULALONGKORN UNIV FAC ARCHITECTURE BANGKOK
结果(我在affiliation_3之后将其截断):
db$cities <- lapply(db$affiliation_1, function(x)x[which(x %in% world.cities$name)])
然后通过:
package producer.serialized.avro;
import org.apache.avro.Schema;
import org.apache.avro.generic.GenericData;
import org.apache.avro.generic.GenericRecord;
import org.apache.kafka.clients.producer.KafkaProducer;
import org.apache.kafka.clients.producer.ProducerConfig;
import org.apache.kafka.clients.producer.ProducerRecord;
import java.util.Properties;
public class Sender4 {
public static void main(String[] args) {
String flightSchema = "{\"type\":\"record\"," + "\"name\":\"Flight\","
+ "\"fields\":[{\"name\":\"flight_id\",\"type\":\"string\"},{\"name\":\"flight_to\",\"type\":\"string\"},{\"name\":\"flight_from\",\"type\":\"string\"}]}";
Properties props = new Properties();
props.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, "192.168.0.1:9092");
props.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG,io.confluent.kafka.serializers.KafkaAvroSerializer.class);
props.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG,io.confluent.kafka.serializers.KafkaAvroSerializer.class);
props.put("schema.registry.url", "http://192.168.0.1:8081");
KafkaProducer producer = new KafkaProducer(props);
Schema.Parser parser = new Schema.Parser();
Schema schema = parser.parse(flightSchema);
GenericRecord avroRecord = new GenericData.Record(schema);
avroRecord.put("flight_id", "myflight");
avroRecord.put("flight_to", "QWE");
avroRecord.put("flight_from", "RTY");
ProducerRecord<String, GenericRecord> record = new ProducerRecord<>("topic9",avroRecord);
producer.send(record);
}
}
但我得到一个空栏。
感谢您的帮助!
答案 0 :(得分:2)
您的示例字符串中有许多城市,因此如果您仍想要获取最后一个城市,则可能需要重新考虑。如果在affiliation
列中找到多个城市。
library(maps)
data(world.cities)
#sample data
df <- data.frame(affiliation = c("[SCOTT, ALLEN J.] UNIV CALIF LOS ANGELES, DEPT GEOG, LOS ANGELES, CA 90095 USA",
"[VAN DUINEN, RIANNE; VAN DER VEEN, ANNE] UNIV TWENTE, DEPT WATER ENGN & MANAGEMENT, DRIENERLOLAAN 5,POB 217, NL-7500 AE ENSCHEDE, NETHERLANDS.",
"[ANANTSUKSOMSRI, SUTEE] CHULALONGKORN UNIV, FAC ARCHITECTURE, BANGKOK, THAILAND.",
"Prem"), stringsAsFactors = F)
#fetch city and it's respective country from 'affiliation' column
cities_country <- lapply(gsub("\\[|\\]|[,;]|\\.","",df$affiliation), function(x)
paste(as.character(world.cities$name[sapply(world.cities$name, grepl, x, ignore.case=T)]),
as.character(world.cities$country.etc[sapply(world.cities$name, grepl, x, ignore.case=T)]),
sep="_"))
df$cities_country <- lapply(cities_country, function(x) if(identical(x, character(0))) NA_character_ else x)
df
输出是:
affiliation
1 [SCOTT, ALLEN J.] UNIV CALIF LOS ANGELES, DEPT GEOG, LOS ANGELES, CA 90095 USA
2 [VAN DUINEN, RIANNE; VAN DER VEEN, ANNE] UNIV TWENTE, DEPT WATER ENGN & MANAGEMENT, DRIENERLOLAAN 5,POB 217, NL-7500 AE ENSCHEDE, NETHERLANDS.
3 [ANANTSUKSOMSRI, SUTEE] CHULALONGKORN UNIV, FAC ARCHITECTURE, BANGKOK, THAILAND.
4 Prem
cities_country
1 Al_Norway, Alle_Switzerland, Allen_Philippines, Allen_USA, Angeles_Costa Rica, Angeles_Philippines, Cali_Colombia, Cot_Costa Rica, Li_Norway, Los Angeles_Chile, Los Angeles_USA, Os_Kyrgyzstan, Os_Norway, U_Micronesia, Usa_Japan
2 Ae_Marshall Islands, Ede_Netherlands, Ede_Nigeria, Enschede_Netherlands, Hede_China, Ine_Marshall Islands, Laa_Austria, Lola_Guinea, Man_Ivory Coast, Mana_French Guiana, Manage_Belgium, Nagem_Luxembourg, Ob_Russia, Ola_Panama, Po_Burkina Faso, U_Micronesia, Van_Turkey, Wa_Ghana, We_New Caledonia
3 Aila_Estonia, Al_Norway, Anan_Japan, Ba_Fiji, Bangkok_Thailand, Hit_Iraq, Ila_Nigeria, Ilan_Taiwan, Long_Thailand, Nan_Thailand, Tsu_Japan, U_Micronesia, Ula_Turkey
4 NA
(注意,在上面的输出中,我保留了所有出现的城市,为方便起见,还将它们与各自的国家一起加上)
答案 1 :(得分:1)
从您显示的几行开始看起来您可以执行以下操作(请注意您错过了对齐外壳):
tmpVec <- sapply(strsplit(db$affiliation, split = ","), function(x) {
cleanVec <- toupper(trimws(x))
cleanVec[max(which(cleanVec %in% toupper(maps::world.cities$name)))]
})
或者在函数中加入更多代码以避免丑陋的警告。
答案 2 :(得分:1)
让我留下解决方案的一部分。据我自己的研究可以看出,方括号中的字母似乎表示个人姓名。例如,我发现Sutee Anantsuksomsri
是一个实际名称。这一观察结果表明我们可能希望删除括号中的文本。
删除方括号中的文本后,我使用tidytext包中的unnest_tokens()
分割单词。请注意,该函数将所有字母转换为小写字母。如果您不喜欢它,可以通过指定to_lower = FALSE
来更改它。首先,我将每个城市名称分成单词。我还为每个城市分配了一个身份证号码。其次,我清理了你的数据。正如我之前所说,我使用gsub()
删除了方括号中的文本。然后,我将unnest_tokens()
应用于数据。我使用cities
中filter()
中的单词来对单词进行子集化。我们得出的结果如下。显然,你还有很多工作要做。我保留下面的采样数据mydf
。我希望你能从这里继续前进。
data(world.cities)
cities <- world.cities %>%
mutate(id = 1:n()) %>%
unnest_tokens(input = name, output = word, token = "words")
temp <- mydf %>%
mutate(affiliation = gsub(x = affiliation, pattern = "\\[.*\\]", replacement = "")) %>%
unnest_tokens(input = affiliation, output = word, token = "words") %>%
filter(word %in% cities$word)
id word
1 1 los
2 1 angeles
3 1 los
4 1 angeles
5 1 ca
6 1 usa
7 2 water
8 2 ae
9 2 enschede
10 3 bangkok
数据强>
mydf <- structure(list(id = 1:3, affiliation = c("[SCOTT, ALLEN J.] UNIV CALIF LOS ANGELES, DEPT GEOG, LOS ANGELES, CA 90095 USA",
"[VAN DUINEN, RIANNE; VAN DER VEEN, ANNE] UNIV TWENTE, DEPT WATER ENGN & MANAGEMENT, DRIENERLOLAAN 5,POB 217, NL-7500 AE ENSCHEDE, NETHERLANDS.",
"[ANANTSUKSOMSRI, SUTEE] CHULALONGKORN UNIV, FAC ARCHITECTURE, BANGKOK, THAILAND."
)), .Names = c("id", "affiliation"), row.names = c(NA, -3L), class = "data.frame")