Question

我的数据框中包含了IMDb中的数据。其中一列有电影标题，括号中附有年份。看起来像这样：

typedef DWORD(__stdcall *NtQueryKeyType)(
    HANDLE  KeyHandle,
    int KeyInformationClass,
    PVOID  KeyInformation,
    ULONG  Length,
    PULONG  ResultLength);
NtQueryKeyType sNtQueryKeyPtr = NULL;

std::wstring GetKeyPathFromKKEY(HKEY key)
{
    std::wstring keyPath;
    if (sNtQueryKeyPtr != NULL) {
        DWORD size = 0;
        DWORD result = 0;
        result = sNtQueryKeyPtr(key, 3, 0, 0, &size);
        if (result == STATUS_BUFFER_TOO_SMALL) {
            size = size + 2;
            wchar_t* buffer = new (std::nothrow) wchar_t[size / sizeof(wchar_t)]; // size is in bytes
            if (buffer != NULL)
            {
                result = sNtQueryKeyPtr(key, 3, buffer, size, &size);
                if (result == STATUS_SUCCESS)
                {
                    buffer[size / sizeof(wchar_t)] = L'\0';
                    keyPath = std::wstring(buffer + 2);
                }
                delete[] buffer;
            }
        }
    }
    return keyPath;
}

DWORD __stdcall VWMLNtQueryKey(
    HANDLE  KeyHandle,
    int KeyInformationClass,
    PVOID  KeyInformation,
    ULONG  Length,
    PULONG  ResultLength) {
    auto str = GetKeyPathFromKKEY((HKEY)KeyHandle);
    if (!str.empty() && base::StringProcess::endsWith(str, L"Internet Explorer\\Version Vector"))
        return STATUS_INVALID_PARAMETER;
    return sNtQueryKeyPtr(KeyHandle, KeyInformationClass, KeyInformation, Length, ResultLength);
}

base::WindowsDllInterceptor ntHook;
ntHook.Init("ntdll.dll");
if (!ntHook.AddHook("NtQueryKey", reinterpret_cast<intptr_t>(&VWMLNtQueryKey),
    (void**)&sNtQueryKeyPtr)) {
    removeVMLTags(&html);
}

我真正想要的是将标题和年份分开。我尝试了几种不同的东西（分裂，strsplit），但我没有成功。我尝试拆分第一个括号，但两个拆分函数似乎不喜欢非字符参数。有人有什么想法？

Answer 1

strsplit适用于character列。因此，如果列是factor类，我们需要将其转换为character类（as.character(..)）。在这里，我匹配零个或多个空格（\\s*），然后是parenetheses（\\(）或|结束括号（\\)）到split

strsplit(as.character(d1$v1), '\\s*\\(|\\)')[[1]]
#[1] "The Shawshank Redemption" "1994"

或者我们可以将括号放在[]内，这样我们就不必转义\\（由@Avinash Raj评论）

strsplit(as.character(d1$v1), '\\s*[()]')[[1]]

数据

v1 <- 'The Shawshank Redemption (1994)'
d1 <- data.frame(v1)

Answer 2

如果你想进行精确的分割（即分裂最后存在的小块），你可以试试这个。

x <- c("The Shawshank Redemption (1994)", "Kung(fu) Pa (23) nda (2010)")
strsplit(as.character(x), "\\s*\\((?=\\d+\\)$)|\\)$", perl=T)
# [[1]]
# [1] "The Shawshank Redemption" "1994"                    

# [[2]]
# [1] "Kung(fu) Pa (23) nda" "2010"

Answer 3

tidyr解决方案

df%>%separate(col,c("name", "year"), "[()]")

感谢Avinash，我可以接受他的正则表达并申请tidyr

m<-c("The Shawshank Redemption (1994)","The Shawshank (Redemption) (1994)", "Kung(fu) Pa (23) nda (2010)")
m2<-data.frame(m)
m2%>%separate(m,c("name", "year"), "\\s*\\((?=\\d+\\)$)|\\)$")

                        name year
1   The Shawshank Redemption 1994
2 The Shawshank (Redemption) 1994
3       Kung(fu) Pa (23) nda 2010

Answer 4

请尝试以下代码：

t(sapply(strsplit(c("The Shawshank Redemption (1994)"), '\\s*\\(|\\)'),rbind))

如果您只传入包含标题的数据框列，则上述代码将起作用。

拆分数据框中的列？

4 个答案:

数据