在数据框中分隔一列,其中每个观察可以有多个并发值

时间:2016-04-25 03:12:21

标签: r dplyr tidyr

我相信我的问题与最佳实践一样多,因为它是关于整理凌乱的数据,所以这里就是这样。

以下是数据框lang.df的摘录,这是学校范围内的学生数据集。列Langauge.Home表示父对此问题的回答: “你在家里说什么语言?”

> lang.df
   Nationality             Language.Home
1           HK                  Mandarin
2       German   Mandarin/English/German
3        Saudi                    Arabic
4    Norwegian                 Norwegian
5           UK                   English
6           HK Mandarin/ Min Nan dialect
7   Australian                  Mandarin
8           HK                  Mandarin
9    Brazilian        Portuguese/English
10      Indian             Hindi/English

对我来说很明显,这是获取此信息的一种糟糕方式,也是一种存储方式很差的方法,但我的工作是使用我拥有的数据。

结果

我想探讨某些家庭语言可能对成就产生的影响。我需要的是能够通过在家里说的单一语言(例如在家里说英语的学生)进行分组。

为此,我似乎必须使用dplyr的Language@home"language.home1", "language.home2", "language.home3"列分成三个(separate())。为我创建的新列中的每个唯一值(即语言)创建一个新列

过程

以下是我有效地完成上述操作的尝试

library(dplyr)
library(tidyr)

#separate Langauge.Home into three new columns
lang.df <- lang.df %>% separate(Language.Home,
        c("language.home1", "language.home2", "language.home3"),
        sep = "/",
        remove = FALSE)

#find distinct languages & remove NAs
langs <- unique(c(lang.df$language.home1,
    lang.df$language.home2,
    lang.df$language.home3))
langs <- langs[!is.na(langs)]

#create boolean column for each unique language in new columns
for (i in langs) {
lang.df[,paste(i)] <- grepl(i, lang.df$Language.Home) 
}

问题

  1. 这种情况叫什么?我试图查看tidyr文档,并在此处查看但未找到任何相关信息。
  2. 是否有比我更优雅的方式对转换进行编码?
  3. 什么是最佳做法
    • 获取此数据(以更改未来的数据录入流程)
    • 从统计角度处理这种情况
  4. 提前感谢您的帮助。我现在只使用R开关大约一年了,这是我的第一篇SO帖子。给我尽可能多的反馈意见!

    数据

    lang.df <- structure(list(Nationality = structure(c(4L, 3L, 7L, 6L, 8L, 
    4L, 1L, 4L, 2L, 5L), .Label = c("Australian", "Brazilian", "German", 
    "HK", "Indian", "Norwegian", "Saudi", "UK"), class = "factor"), 
    `Language.Home` = structure(c(4L, 6L, 1L, 7L, 2L, 5L, 4L, 
    4L, 8L, 3L), .Label = c("Arabic", "English", "Hindi/English", 
    "Mandarin", "Mandarin/ Min Nan dialect", "Mandarin/English/German", 
    "Norwegian", "Portuguese/English"), class = "factor")), row.names = c(NA, 
    10L), .Names = c("Nationality", "Language.Home"), class = "data.frame")
    

3 个答案:

答案 0 :(得分:5)

我们可以使用struct qt_meta_stringdata_vk__VNode_t { QByteArrayData data[1]; char stringdata0[10]; }; #define QT_MOC_LITERAL(idx, ofs, len) \ Q_STATIC_BYTE_ARRAY_DATA_HEADER_INITIALIZER_WITH_OFFSET(len, \ qptrdiff(offsetof(qt_meta_stringdata_vk__VNode_t, stringdata0) + ofs \ - idx * sizeof(QByteArrayData)) \ ) static const qt_meta_stringdata_vk__VNode_t qt_meta_stringdata_vk__VNode = { { QT_MOC_LITERAL(0, 0, 9) // "vk::VNode" }, "vk::VNode" }; #undef QT_MOC_LITERAL static const uint qt_meta_data_vk__VNode[] = { // content: 7, // revision 0, // classname 0, 0, // classinfo 0, 0, // methods 0, 0, // properties 0, 0, // enums/sets 0, 0, // constructors 0, // flags 0, // signalCount 0 // eod }; void vk::VNode::qt_static_metacall(QObject *_o, QMetaObject::Call _c, int _id, void **_a) { Q_UNUSED(_o); Q_UNUSED(_id); Q_UNUSED(_c); Q_UNUSED(_a); } const QMetaObject vk::VNode::staticMetaObject = { { &QObject::staticMetaObject, qt_meta_stringdata_vk__VNode.data, qt_meta_data_vk__VNode, qt_static_metacall, Q_NULLPTR, Q_NULLPTR} }; 中的VNode::staticMetaObject分割语言。家庭&#39;使用分隔符qt_meta_stringdata_vk__VNode并将其转换为qt_meta_data_vk__VNode格式。

void VScene::markActiveObject(VObject *obj)
{
    if (obj){
        obj->markActive();
        emit activeObjectChanged(obj);
    }
}

然后,使用var myStr2 = str.replace(/'/g, "\\'"); console.log(myStr2); // gives O\'neil 转换来自&#39; long&#39;广泛&#39;

public class Receiver extends BroadcastReceiver {
@Override
public void onReceive(Context context, Intent intent) {
    // Check if the application is install or uninstall and display the message accordingly
    if (intent.getAction().equals("android.intent.action.PACKAGE_ADDED")) {
        // Application Install
        Log.e("Package Added:-", intent.getData().toString());

    } else if (intent.getAction().equals("android.intent.action.PACKAGE_REMOVED")) {

        Log.e("Package Removed:-", intent.getData().toString());
    } else if (intent.getAction().equals("android.intent.action.PACKAGE_REPLACED")) {
        Log.e("Package Replaced:-", intent.getData().toString());
    }


}

注意:有重复的国籍&#39;行,所以上面将公共元素组合在一起。将它组合在一起可能更好。

如果我们需要根据每一行设置逻辑列(不论类似的国籍&#39;)

<receiver android:name=".Receiver">
        <intent-filter android:priority="100">
            <action android:name="android.intent.action.PACKAGE_INSTALL"/>
            <action android:name="android.intent.action.PACKAGE_ADDED"/>
            <action android:name="android.intent.action.PACKAGE_REMOVED"/>
            <data android:scheme="package"/>
        </intent-filter>
    </receiver>

分割语言后,cSplit的其他选项为splitstackshape/

long

答案 1 :(得分:3)

获得长篇形式的一种简单方法是tidyr::unnest()

library(dplyr)
library(tidyr)
library(stringr)

lang.df %>% 
  mutate(Language.Home = str_split(Language.Home, "/")) %>% 
  unnest()
#>    Nationality    Language.Home
#> 1           HK         Mandarin
#> 2       German         Mandarin
#> 3       German          English
#> 4       German           German
#> 5        Saudi           Arabic
#> 6    Norwegian        Norwegian
#> 7           UK          English
#> 8           HK         Mandarin
#> 9           HK  Min Nan dialect
#> 10  Australian         Mandarin
#> 11          HK         Mandarin
#> 12   Brazilian       Portuguese
#> 13   Brazilian          English
#> 14      Indian            Hindi
#> 15      Indian          English

答案 2 :(得分:2)

这是一个基本方法,总共只有几行

lang.df <- structure(list(Nationality = structure(c(4L, 3L, 7L, 6L, 8L, 4L, 1L, 4L, 2L, 5L), .Label = c("Australian", "Brazilian", "German", "HK", "Indian", "Norwegian", "Saudi", "UK"), class = "factor"), `Language.Home` = structure(c(4L, 6L, 1L, 7L, 2L, 5L, 4L, 4L, 8L, 3L), .Label = c("Arabic", "English", "Hindi/English", "Mandarin", "Mandarin/ Min Nan dialect", "Mandarin/English/German", "Norwegian", "Portuguese/English"), class = "factor")), row.names = c(NA, 10L), .Names = c("Nationality", "Language.Home"), class = "data.frame")

第二部分:新数据框,每种语言分成不同的列并按顺序标记

dd <- read.table(text = gsub('/\\s*', ';', lang.df$Language.Home),
                 sep = ';', na.strings = '', fill = TRUE, as.is = TRUE,
                 col.names = paste0('lang.home', 1:3))
#      lang.home1      lang.home2 lang.home3
#   1    Mandarin            <NA>       <NA>
#   2    Mandarin         English     German
#   3      Arabic            <NA>       <NA>
#   4   Norwegian            <NA>       <NA>
#   5     English            <NA>       <NA>
#   6    Mandarin Min Nan dialect       <NA>
#   7    Mandarin            <NA>       <NA>
#   8    Mandarin            <NA>       <NA>
#   9  Portuguese         English       <NA>
#  10       Hindi         English       <NA>

第三部分:每种独特语言的逻辑指标

lang <- na.omit(sort(unique(unlist(dd))))
idx <- `colnames<-`(t(apply(dd, 1, function(x) lang %in% x)), lang)

#       Arabic English German Hindi Mandarin Min Nan dialect Norwegian Portuguese
#  [1,]  FALSE   FALSE  FALSE FALSE     TRUE           FALSE     FALSE      FALSE
#  [2,]  FALSE    TRUE   TRUE FALSE     TRUE           FALSE     FALSE      FALSE
#  [3,]   TRUE   FALSE  FALSE FALSE    FALSE           FALSE     FALSE      FALSE
#  [4,]  FALSE   FALSE  FALSE FALSE    FALSE           FALSE      TRUE      FALSE
#  [5,]  FALSE    TRUE  FALSE FALSE    FALSE           FALSE     FALSE      FALSE
#  [6,]  FALSE   FALSE  FALSE FALSE     TRUE            TRUE     FALSE      FALSE
#  [7,]  FALSE   FALSE  FALSE FALSE     TRUE           FALSE     FALSE      FALSE
#  [8,]  FALSE   FALSE  FALSE FALSE     TRUE           FALSE     FALSE      FALSE
#  [9,]  FALSE    TRUE  FALSE FALSE    FALSE           FALSE     FALSE       TRUE
# [10,]  FALSE    TRUE  FALSE  TRUE    FALSE           FALSE     FALSE      FALSE

结合三个部分:

cbind(lang.df, dd, idx)

#    Nationality             Language.Home lang.home1      lang.home2 lang.home3 Arabic English German Hindi Mandarin Min Nan dialect Norwegian Portuguese
# 1           HK                  Mandarin   Mandarin            <NA>       <NA>  FALSE   FALSE  FALSE FALSE     TRUE           FALSE     FALSE      FALSE
# 2       German   Mandarin/English/German   Mandarin         English     German  FALSE    TRUE   TRUE FALSE     TRUE           FALSE     FALSE      FALSE
# 3        Saudi                    Arabic     Arabic            <NA>       <NA>   TRUE   FALSE  FALSE FALSE    FALSE           FALSE     FALSE      FALSE
# 4    Norwegian                 Norwegian  Norwegian            <NA>       <NA>  FALSE   FALSE  FALSE FALSE    FALSE           FALSE      TRUE      FALSE
# 5           UK                   English    English            <NA>       <NA>  FALSE    TRUE  FALSE FALSE    FALSE           FALSE     FALSE      FALSE
# 6           HK Mandarin/ Min Nan dialect   Mandarin Min Nan dialect       <NA>  FALSE   FALSE  FALSE FALSE     TRUE            TRUE     FALSE      FALSE
# 7   Australian                  Mandarin   Mandarin            <NA>       <NA>  FALSE   FALSE  FALSE FALSE     TRUE           FALSE     FALSE      FALSE
# 8           HK                  Mandarin   Mandarin            <NA>       <NA>  FALSE   FALSE  FALSE FALSE     TRUE           FALSE     FALSE      FALSE
# 9    Brazilian        Portuguese/English Portuguese         English       <NA>  FALSE    TRUE  FALSE FALSE    FALSE           FALSE     FALSE       TRUE
# 10      Indian             Hindi/English      Hindi         English       <NA>  FALSE    TRUE  FALSE  TRUE    FALSE           FALSE     FALSE      FALSE