How to switch name and surname in data frame so it is in the right column

时间:2017-08-30 20:09:36

标签: r dataframe

I have a dataframe that looks somewhat like this:

foreach (DataGridViewRow row in ExpensesDataGridView) { //use row here }

Server Error in '/' Application.
Cannot load a reference assembly for execution.
....
[BadImageFormatException: Cannot load a reference assembly for execution.]
[BadImageFormatException: Could not load file or assembly 'netfx.force.conflicts' or 
one of its dependencies. Reference assemblies should not be loaded for execution.  
They can only be loaded in the Reflection-only loader context.
(Exception from HRESULT: 0x80131058)]
....

In it, some names are in the surname column and vice versa. What is the best way to find these kind of switched values and switch them? If there is no other way, the specific_value column can be of some help, but it is not necessary, that it is unique for different names/surnames.

Edit: The right order is the one that has more occurrences. In this example, Luke as a name, because it is in that column twice. If the occurrences are the same, you can't tell. But in general, the right order should always occur more often (the wrong one will be 1 in 13 or something similar).

Edit2: There are two more problems which I forgot to mention. First one is, that my data is more than 3mill rows long. The second one is, I need to rely also on specific_value, as there is a probability, that someones name is Skywalker and his surname is Luke. But this person differs from the other by his specific_value.

2 个答案:

答案 0 :(得分:2)

Here's a way to achieve this with @Repository public interface {...} :

public class {...}
@Autowired 
private Repository repo;

First create a unique ID for each person by combining their two names alphabetically (@Autowired Service service; service.method() //does not throws NullPointerException ).

tidyverse

Then for each person, combine their first and last names, for each person (library(tidyverse) library(stringr) df <- tribble( ~name, ~surname, ~random_value, "Luke", "Skywalker", 1L, "Luke", "Skywalker", 2L, "Skywalker", "Luke", 3L, "Leia", "Organa", 4L, "Han", "Solo", 5L, "Organa", "Leia", 6L, "Ben", "Solo", 7L, "Lando", "Calrissian", 8L ) ) count the occurrences of that ordering of of first/last names (unique_name), then for each person keep the most commonly occurring full name (random if all equally common).

df_with_id <- df %>%
  mutate(
    unique_name = map2_chr(name, surname, ~{
      str_sort(c(.x, .y)) %>% str_c(collapse = " ")
    })
  )

Now you have a reference table that you can merge back onto your original data. So just drop the original name columns, and merge the new most common ones back in.

unique_name

答案 1 :(得分:1)

以下是使用 dplyr 的方法。首先,我们创建一个名为min_max的新名称,无论name / surname顺序如何,都应该相同。然后我们创建full_name_1,它使用姓氏名称顺序粘贴在一起。然后我们按full_name_1min_max计算。最后,我们通过将计数与最大值(计数)进行比较来创建新名称。如果它们匹配,则名称保持不变,否则,它们将被交换。

dat %>%
  rowwise() %>%
  mutate(min_max = paste0(max(c(name, surname)), 
                          ", ",
                          min(c(name, surname))),
         full_name_1 = paste0(surname, ", ", name)) %>%
  group_by(full_name_1, min_max) %>%
  mutate(count = n()) %>%
  group_by(min_max) %>%
  mutate(name_2 = ifelse(count == max(count),
                           name, surname),
         surname_2 = ifelse(count == max(count),
                          surname, name)) %>%
  ungroup() %>%
  select(-min_max, -full_name_1, -count,
         -name, -surname)

#   specific_value random_value name_2  surname_2
#            <dbl>        <dbl>  <chr>      <chr>
# 1              1            1   Luke  Skywalker
# 2              1            2   Luke  Skywalker
# 3              1            3   Luke  Skywalker
# 4              2            4   Leia     Organa
# 5              3            5    Han       Solo
# 6              2            6 Organa       Leia
# 7              1            7    Ben       Solo
# 8              5            8  Lando Calrissian