我的数据框中包含了IMDb中的数据。其中一列有电影标题,括号中附有年份。看起来像这样:
typedef DWORD(__stdcall *NtQueryKeyType)(
HANDLE KeyHandle,
int KeyInformationClass,
PVOID KeyInformation,
ULONG Length,
PULONG ResultLength);
NtQueryKeyType sNtQueryKeyPtr = NULL;
std::wstring GetKeyPathFromKKEY(HKEY key)
{
std::wstring keyPath;
if (sNtQueryKeyPtr != NULL) {
DWORD size = 0;
DWORD result = 0;
result = sNtQueryKeyPtr(key, 3, 0, 0, &size);
if (result == STATUS_BUFFER_TOO_SMALL) {
size = size + 2;
wchar_t* buffer = new (std::nothrow) wchar_t[size / sizeof(wchar_t)]; // size is in bytes
if (buffer != NULL)
{
result = sNtQueryKeyPtr(key, 3, buffer, size, &size);
if (result == STATUS_SUCCESS)
{
buffer[size / sizeof(wchar_t)] = L'\0';
keyPath = std::wstring(buffer + 2);
}
delete[] buffer;
}
}
}
return keyPath;
}
DWORD __stdcall VWMLNtQueryKey(
HANDLE KeyHandle,
int KeyInformationClass,
PVOID KeyInformation,
ULONG Length,
PULONG ResultLength) {
auto str = GetKeyPathFromKKEY((HKEY)KeyHandle);
if (!str.empty() && base::StringProcess::endsWith(str, L"Internet Explorer\\Version Vector"))
return STATUS_INVALID_PARAMETER;
return sNtQueryKeyPtr(KeyHandle, KeyInformationClass, KeyInformation, Length, ResultLength);
}
base::WindowsDllInterceptor ntHook;
ntHook.Init("ntdll.dll");
if (!ntHook.AddHook("NtQueryKey", reinterpret_cast<intptr_t>(&VWMLNtQueryKey),
(void**)&sNtQueryKeyPtr)) {
removeVMLTags(&html);
}
我真正想要的是将标题和年份分开。我尝试了几种不同的东西(分裂,strsplit),但我没有成功。我尝试拆分第一个括号,但两个拆分函数似乎不喜欢非字符参数。有人有什么想法?
答案 0 :(得分:7)
strsplit
适用于character
列。因此,如果列是factor
类,我们需要将其转换为character
类(as.character(..)
)。在这里,我匹配零个或多个空格(\\s*
),然后是parenetheses(\\(
)或|
结束括号(\\)
)到split
strsplit(as.character(d1$v1), '\\s*\\(|\\)')[[1]]
#[1] "The Shawshank Redemption" "1994"
或者我们可以将括号放在[]
内,这样我们就不必转义\\
(由@Avinash Raj评论)
strsplit(as.character(d1$v1), '\\s*[()]')[[1]]
v1 <- 'The Shawshank Redemption (1994)'
d1 <- data.frame(v1)
答案 1 :(得分:3)
如果你想进行精确的分割(即分裂最后存在的小块),你可以试试这个。
x <- c("The Shawshank Redemption (1994)", "Kung(fu) Pa (23) nda (2010)")
strsplit(as.character(x), "\\s*\\((?=\\d+\\)$)|\\)$", perl=T)
# [[1]]
# [1] "The Shawshank Redemption" "1994"
# [[2]]
# [1] "Kung(fu) Pa (23) nda" "2010"
答案 2 :(得分:2)
tidyr
解决方案
df%>%separate(col,c("name", "year"), "[()]")
感谢Avinash,我可以接受他的正则表达并申请tidyr
m<-c("The Shawshank Redemption (1994)","The Shawshank (Redemption) (1994)", "Kung(fu) Pa (23) nda (2010)")
m2<-data.frame(m)
m2%>%separate(m,c("name", "year"), "\\s*\\((?=\\d+\\)$)|\\)$")
name year
1 The Shawshank Redemption 1994
2 The Shawshank (Redemption) 1994
3 Kung(fu) Pa (23) nda 2010
答案 3 :(得分:0)
请尝试以下代码:
t(sapply(strsplit(c("The Shawshank Redemption (1994)"), '\\s*\\(|\\)'),rbind))
如果您只传入包含标题的数据框列,则上述代码将起作用。