R:为什么我会通过spread()丢失数据?

时间:2019-02-26 04:36:46

标签: r dplyr data-cleaning spread

我有一个像这样的小标题。

public static List<ProcessInfo> getProcessList() throws Exception {
        /* Initialize the empty process list. */
        List<ProcessInfo> processList = new ArrayList<ProcessInfo>();

        /* Create the process snapshot. */
        WinNT.HANDLE snapshot = Kernel32.INSTANCE.CreateToolhelp32Snapshot(Tlhelp32.TH32CS_SNAPPROCESS, new WinDef.DWORD(0));

        Tlhelp32.PROCESSENTRY32.ByReference pe = new Tlhelp32.PROCESSENTRY32.ByReference();
        for (boolean more = Kernel32.INSTANCE.Process32First(snapshot, pe); more; more = Kernel32.INSTANCE.Process32Next(snapshot, pe)) {
            /* Open this process; ignore processes that we cannot open. */
            WinNT.HANDLE hProcess = Kernel32.INSTANCE.OpenProcess(PROCESS_QUERY_INFORMATION | PROCESS_QUERY_LIMITED_INFORMATION, /* PROCESS_QUERY_LIMITED_INFORMATION */false, pe.th32ProcessID.intValue());
            if (hProcess == null) {
                continue;
            }

            /* Get the image name. */
            char[] imageNameChars = new char[1024];
            IntByReference imageNameLen = new IntByReference(imageNameChars.length);

            if (!Kernel32.INSTANCE.QueryFullProcessImageName(hProcess, 0, imageNameChars, imageNameLen)) {
                throw new Exception("Couldn't get process image name for "
                        + pe.th32ProcessID.intValue());
            }

            /* Add the process info to our list. */
            processList.add(new ProcessInfo(pe.th32ProcessID.intValue(), pe.th32ParentProcessID.intValue(), new String(imageNameChars, 0, imageNameLen.getValue())));

            /* Close the process handle. */
            Kernel32.INSTANCE.CloseHandle(hProcess);
        }

        /* Close the process snapshot. */
        Kernel32.INSTANCE.CloseHandle(snapshot);

        /* Return the process list. */
        return processList;
    }

我想将其扩展为一个宽数据框。我使用了这段代码。

# A tibble: 1,000 x 3
   id                 question                               answer                                                          
   <chr>                  <chr>                                <chr>                                                                     
 1 aaa               What is your favorite color?                Green                                                                        
 2 aaa               What is your favorite band?                 Green Day                                                       
 3 aaabb             What is your favorite color?                Blue                                                                                
 4 aaabb             What is your favorite band?                Blue            
 5 ccc               What is your favorite color?                Blue                                                                        
 6 ccc               What is the difference between you and me?  Five bank accounts                                             
# ... with more rows

但是,我最终得到的是一个填充有空行的数据框。

aTibble %>% distinct() %>%  spread(question, answer)

在最初的小标题中,某些行具有ID,然后对问题和答案为null。单个ID没有重复的问题。就是说,不同的ID可以回答不同的问题,它们的问题并不完全相同。

此外,我没有进入V1行,这也不是我最初的想法。它出现在spread()之后。

令人沮丧的是,当我在一个小的数据集上执行该函数时,它就可以正常工作。当我对整个数据集(约15万条记录)执行此功能时,会得到NA。

1 个答案:

答案 0 :(得分:2)

很难看出为什么这行不通。 dcastreshape2中很好的替代选择。您可以实现同一件事。

aTibble %>% distinct() %>% dcast(id ~ question, value.var = "answer")