I have a dataframe that looks somewhat like this:
foreach (DataGridViewRow row in ExpensesDataGridView)
{
//use row here
}
Server Error in '/' Application.
Cannot load a reference assembly for execution.
....
[BadImageFormatException: Cannot load a reference assembly for execution.]
[BadImageFormatException: Could not load file or assembly 'netfx.force.conflicts' or
one of its dependencies. Reference assemblies should not be loaded for execution.
They can only be loaded in the Reflection-only loader context.
(Exception from HRESULT: 0x80131058)]
....
In it, some names are in the surname column and vice versa. What is the best way to find these kind of switched values and switch them? If there is no other way, the specific_value column can be of some help, but it is not necessary, that it is unique for different names/surnames.
Edit: The right order is the one that has more occurrences. In this example, Luke as a name, because it is in that column twice. If the occurrences are the same, you can't tell. But in general, the right order should always occur more often (the wrong one will be 1 in 13 or something similar).
Edit2: There are two more problems which I forgot to mention. First one is, that my data is more than 3mill rows long. The second one is, I need to rely also on specific_value, as there is a probability, that someones name is Skywalker and his surname is Luke. But this person differs from the other by his specific_value.
答案 0 :(得分:2)
Here's a way to achieve this with @Repository
public interface {...}
:
public class {...}
@Autowired
private Repository repo;
First create a unique ID for each person by combining their two names alphabetically (@Autowired Service service;
service.method() //does not throws NullPointerException
).
tidyverse
Then for each person, combine their first and last names, for each person (library(tidyverse)
library(stringr)
df <- tribble(
~name, ~surname, ~random_value,
"Luke", "Skywalker", 1L,
"Luke", "Skywalker", 2L,
"Skywalker", "Luke", 3L,
"Leia", "Organa", 4L,
"Han", "Solo", 5L,
"Organa", "Leia", 6L,
"Ben", "Solo", 7L,
"Lando", "Calrissian", 8L
)
) count the occurrences of that ordering of of first/last names (unique_name
), then for each person keep the most commonly occurring full name (random if all equally common).
df_with_id <- df %>%
mutate(
unique_name = map2_chr(name, surname, ~{
str_sort(c(.x, .y)) %>% str_c(collapse = " ")
})
)
Now you have a reference table that you can merge back onto your original data. So just drop the original name columns, and merge the new most common ones back in.
unique_name
答案 1 :(得分:1)
以下是使用 dplyr
的方法。首先,我们创建一个名为min_max
的新名称,无论name
/ surname
顺序如何,都应该相同。然后我们创建full_name_1
,它使用姓氏名称顺序粘贴在一起。然后我们按full_name_1
和min_max
计算。最后,我们通过将计数与最大值(计数)进行比较来创建新名称。如果它们匹配,则名称保持不变,否则,它们将被交换。
dat %>%
rowwise() %>%
mutate(min_max = paste0(max(c(name, surname)),
", ",
min(c(name, surname))),
full_name_1 = paste0(surname, ", ", name)) %>%
group_by(full_name_1, min_max) %>%
mutate(count = n()) %>%
group_by(min_max) %>%
mutate(name_2 = ifelse(count == max(count),
name, surname),
surname_2 = ifelse(count == max(count),
surname, name)) %>%
ungroup() %>%
select(-min_max, -full_name_1, -count,
-name, -surname)
# specific_value random_value name_2 surname_2
# <dbl> <dbl> <chr> <chr>
# 1 1 1 Luke Skywalker
# 2 1 2 Luke Skywalker
# 3 1 3 Luke Skywalker
# 4 2 4 Leia Organa
# 5 3 5 Han Solo
# 6 2 6 Organa Leia
# 7 1 7 Ben Solo
# 8 5 8 Lando Calrissian