Question

我有两个数据框，每个数据框有两列

活动：列名称 - ＆gt; “区”“城市”
村庄：列名称 - ＆gt; “区”“村”

“事件”数据框在其所在地区的村庄中发生了地区名称和事件。 “城市”栏是一个发送数据，其中包含隐藏在各个国家/地区之间的相应区域的村名称。

data in event:
    District    City
    Dst1        Dance program occured in village1 near highway
    Dst1        Regional gatherting atvillage2  --> note: "at" and "village2" typo error partial search required !
    Dst2        Music showsin village3with famous songs --> note: i am not concerned about "showsin" but "village1with" has village3 name
    Dst3        Sunset is pretty nice in these area
    .
    .
    .
    .
    .

“村庄”数据框在每个区域下都有地区名称和完整的村庄列表

data in villages:
    District    Village
    Dst1        Village1
    Dst1        Village2
    Dst2        Village1
    .
    .
    .
    .
    .

要求：

我必须从event $ City列中提取村名，并在“村庄”数据框中检查村名是否存在。示例：事件中的前3行$ City在语句之间隐藏了村名，我必须提取它们并在“村庄”数据框中交叉检查
如果找不到村名，请在“event”dataframe中使用区名称
此外，在同一地区，多个村庄名称以同名开头。部分搜索需要
最后，我将不得不为这些村庄找到lat-long地理编码

最终所需的输出（“事件”必须添加一列“乡村”）

event:
    District    City                            Village
--------------------------------------------------------------------------------------------
    Dst1        Dance program occured in village1 near highway      Village1 --> present in df2
    Dst1        Regional gatherting atvillage2              Village2 --> present in df2
    Dst2        Music showsin village3with famous songs         Village3  --> present in df2
    Dst3        Sunset is pretty nice in these area         Dst3 --> no village name mentioned and not found in df2

我在R代码下面试过..

我尝试使用grepl和pmatch ..不满意结果..任何人都可以帮助我调整代码并获得更准确的结果

for (i in (1:length(event$City))){
  found<-FALSE
  if ((!Trim(event$District[i]) == "")&(!Trim(event$District[i]) == "fatehgarh")){
      # Pick all the villages from a district
      village<-filter(villages, DISTRICT.NAME == Trim(event$District[i]))
      # Village column has district name as well. just dropping
      data1<-as.data.frame(sapply(village$Area.Name, function(x)
                           gsub(paste(Trim(event$District[i]), collapse = '|'), '', x)))
      data1=data1[(!data1=="")]

      # Remove any unwanted special chars from the City sentances
      text<-str_replace_all(event$City[i], "[[:punct:]]", "")
      len=length(data1[[1]])
      print (paste0("i is : ", i))
      x=strsplit(text, " ")[[1]]

      # Loop through each village  
      for (j in (1:len)){
        y<-paste0(strsplit(as.character(data1[j])," ")[[1]], collapse = " ")
        if (!isTRUE(all.equal(y, character(0)))){
         bool<-grepl(paste0(x, collapse = "|"), y)
          # match<-pmatch(strsplit(y, " ")[[1]], x)
          # bool=all(is.na(match))
            if (!bool){
              event$Village[i]=y
              found<-TRUE
              count1 = count1 +1
              print(paste0("Village - ", y))
              break()
          }
        }  
      }
      if (!found){
        event$Village[i]=event$District[i]
        count2 = count2 +1
      }
  }
}

示例数据帧：

city=c("Regional priests vising **basavapattana**in karnataka",
       "Showman of bollywood visitedhonnali lastweek",
       "Reliance has launched jio tower in chennagiri",
       "Railway track damaged near sargara",
       "Electioncommittee passed strict order to SorabaMLA",
       "Rural villages are now provided with WiFi facility"
       )

district = c("Davanagere",  "Shimoga",  "Shimoga",  "Shimoga",  "Shimoga",  "Dharwad")
event = data.frame(city, district)

district=c("Davanagere",  "Davanagere",  "Davanagere",  "Davanagere",  "Davanagere",  "Davanagere",  "Davanagere",  "Davanagere",  "Davanagere",  "Davanagere",  "Shimoga",  "Shimoga",  "Shimoga",  "Shimoga",  "Shimoga")
villages=c("Arasapura",  "Attigere",  "Avaragolla",  "Bada",  "Balluru",  "Basavanahalu",  "**Basavapatna**",  "Batlekatte",  "Bavihalu",  "Belavanur",  "Hosanagara",  "Sagar",  "Shikarpur",  "Sorab",  "Tirthahalli")

villages = data.frame(district, villages)

另外，如果仔细观察，那么Bold或双星都是一样的，只是拼写错误..但它必须选择Basavapatna ......

此外，数据量非常大..在优化方面也需要一些帮助..

提取特定的Word并对其进行地理编码

0 个答案: