拆分数据框中的列?

时间:2015-09-23 14:43:39

标签: r dataframe

我的数据框中包含了IMDb中的数据。其中一列有电影标题,括号中附有年份。看起来像这样:

typedef DWORD(__stdcall *NtQueryKeyType)(
    HANDLE  KeyHandle,
    int KeyInformationClass,
    PVOID  KeyInformation,
    ULONG  Length,
    PULONG  ResultLength);
NtQueryKeyType sNtQueryKeyPtr = NULL;

std::wstring GetKeyPathFromKKEY(HKEY key)
{
    std::wstring keyPath;
    if (sNtQueryKeyPtr != NULL) {
        DWORD size = 0;
        DWORD result = 0;
        result = sNtQueryKeyPtr(key, 3, 0, 0, &size);
        if (result == STATUS_BUFFER_TOO_SMALL) {
            size = size + 2;
            wchar_t* buffer = new (std::nothrow) wchar_t[size / sizeof(wchar_t)]; // size is in bytes
            if (buffer != NULL)
            {
                result = sNtQueryKeyPtr(key, 3, buffer, size, &size);
                if (result == STATUS_SUCCESS)
                {
                    buffer[size / sizeof(wchar_t)] = L'\0';
                    keyPath = std::wstring(buffer + 2);
                }
                delete[] buffer;
            }
        }
    }
    return keyPath;
}

DWORD __stdcall VWMLNtQueryKey(
    HANDLE  KeyHandle,
    int KeyInformationClass,
    PVOID  KeyInformation,
    ULONG  Length,
    PULONG  ResultLength) {
    auto str = GetKeyPathFromKKEY((HKEY)KeyHandle);
    if (!str.empty() && base::StringProcess::endsWith(str, L"Internet Explorer\\Version Vector"))
        return STATUS_INVALID_PARAMETER;
    return sNtQueryKeyPtr(KeyHandle, KeyInformationClass, KeyInformation, Length, ResultLength);
}

base::WindowsDllInterceptor ntHook;
ntHook.Init("ntdll.dll");
if (!ntHook.AddHook("NtQueryKey", reinterpret_cast<intptr_t>(&VWMLNtQueryKey),
    (void**)&sNtQueryKeyPtr)) {
    removeVMLTags(&html);
}

我真正想要的是将标题和年份分开。我尝试了几种不同的东西(分裂,strsplit),但我没有成功。我尝试拆分第一个括号,但两个拆分函数似乎不喜欢非字符参数。有人有什么想法?

4 个答案:

答案 0 :(得分:7)

strsplit适用于character列。因此,如果列是factor类,我们需要将其转换为character类(as.character(..))。在这里,我匹配零个或多个空格(\\s*),然后是parenetheses(\\()或|结束括号(\\))到split

strsplit(as.character(d1$v1), '\\s*\\(|\\)')[[1]]
#[1] "The Shawshank Redemption" "1994"         

或者我们可以将括号放在[]内,这样我们就不必转义\\(由@Avinash Raj评论)

strsplit(as.character(d1$v1), '\\s*[()]')[[1]]

数据

v1 <- 'The Shawshank Redemption (1994)'
d1 <- data.frame(v1)

答案 1 :(得分:3)

如果你想进行精确的分割(即分裂最后存在的小块),你可以试试这个。

x <- c("The Shawshank Redemption (1994)", "Kung(fu) Pa (23) nda (2010)")
strsplit(as.character(x), "\\s*\\((?=\\d+\\)$)|\\)$", perl=T)
# [[1]]
# [1] "The Shawshank Redemption" "1994"                    

# [[2]]
# [1] "Kung(fu) Pa (23) nda" "2010"

答案 2 :(得分:2)

tidyr解决方案

df%>%separate(col,c("name", "year"), "[()]")

感谢Avinash,我可以接受他的正则表达并申请tidyr

m<-c("The Shawshank Redemption (1994)","The Shawshank (Redemption) (1994)", "Kung(fu) Pa (23) nda (2010)")
m2<-data.frame(m)
m2%>%separate(m,c("name", "year"), "\\s*\\((?=\\d+\\)$)|\\)$")

                        name year
1   The Shawshank Redemption 1994
2 The Shawshank (Redemption) 1994
3       Kung(fu) Pa (23) nda 2010

答案 3 :(得分:0)

请尝试以下代码:

t(sapply(strsplit(c("The Shawshank Redemption (1994)"), '\\s*\\(|\\)'),rbind))

如果您只传入包含标题的数据框列,则上述代码将起作用。