我有一个大型数据集,其形式如下:从语料库中提取文本字符串:
var mInput: InputStream? = nil
var mOutput: OutputStream? = nil
var mConverter: DatabaseConverterXml? = nil
var mType: ConverterTypeSynchronization? = nil
func urlSession(_ session: URLSession, task: URLSessionTask, needNewBodyStream completionHandler: @escaping (InputStream?) -> Void)
{
//Create the input and output stream and bind them, so that what the
//output stream writes ends up in the buffer of the input stream.
let bufferSize: Int = 1024
Stream.getBoundStreams(withBufferSize: bufferSize, inputStream: &mInput, outputStream: &mOutput)
//This part is not really important for you, it starts the generation of
//the XML, which is written directly to the output stream.
mConverter = DatabaseConverterXml(prettyPrint: false)
mType = ConverterTypeSynchronization(progressAlert: nil)
mType.convert(using: mConverter, writingTo: [Writable.Stream(mOutput!)])
{
successfull in
print("Conversion Complete! Successfull: \(successfull)" )
}
//The input stream is then handed over via the
//completion handler of the delegate method.
completionHandler(mInput!)
}
...其中A和B在类别1中具有共同的string1和string3;第2类没有共同点;和第3类中的所有三个共同点。
我希望获得每个班级的A组和B组之间匹配的字符串数量。字符串匹配可以按任何顺序排列;例如针对c(string2,string1)计算的c(string1,string2)应计为两个匹配项。此外,匹配应仅在每个类别中的唯一字符串之间;例如c(string1,string2),c(string1)与c(string2,string1)应该仍然只是两个匹配。例如:
Category Group Text_Strings
1 A c(string1, string2, string3)
1 A c(string1, string3)
1 B character(0)
1 B c(string1)
1 B c(string3)
2 A character(0)
2 A character(0)
2 B c(string1, string3)
3 A c(string1, string2, string3)
3 A character(0)
3 A c(string1)
3 B character(0)
3 B c(string1, string2, string3)
...即使重复了string1,也只会产生一个匹配。
我的最终输出应如下所示:
Category Group Text_Strings
4 A c(string1, string2, string3)
4 A c(string1)
4 B c(string1)
4 B c(string1)
我做了很多研究,但我自己找不到答案。在我看来,我可以通过Group对数据帧进行子集化,以某种方式在Categories上聚合/连接字符串,然后使用lapply()和intersect()...类似
Category Matches
1 2
2 0
3 3
4 1
当然这是缺少的步骤而且不起作用,但我是否在正确的轨道上?谢谢你的帮助!
更新:jeremycg的解决方案非常有用,但我的数据非常混乱,以至于它不会接受parse()。感谢另一个用户在另一个线程中,我通过基于逗号分隔符拆分行来解决这个问题,而不是直接尝试删除:
for(i in 1:nrow(data)[1]) {
data$matches[i] <- sum(intersect(subset(data, Group=="A")$Text_Strings[i],
subset(data, Group=="B")$Text_Strings[i]))
}
这产生了相同的unnested数据,但很多更清洁。
答案 0 :(得分:2)
你可以使用dplyr和tidyr:
library(dplyr)
library(tidyr)
x %>% unnest() %>% #spread out the nested columns
distinct() %>% #remove dupes
group_by(Category) %>% #by Category
summarise(out = sum(Text_Strings[Group == 'A'] %in% Text_Strings[Group == 'B'])) #sum the overlap
,并提供:
Source: local data frame [3 x 2]
Category out
(int) (int)
1 1 2
2 2 0
3 3 3
您的实际数据非常混乱 - 您应该尝试修复输出为“长格式”的任何内容。这是一个笨重的解决方案:
x$listcites = gsub('\\\\n', '',x$listcites) #remove newlines
x$listcites = gsub("\"", "'", x$listcites, fixed = TRUE) #remove quotes to singles
x$listcites[grepl('^[^c]',x$listcites)] = paste("c('", x$listcites[grepl('^[^c]',x$listcites)],"')", sep = '') #fix single lines to same format
x$listcites = sapply(x$listcites, function(x) eval(parse(text = x))) #eval to vecs in dataframe
x %>% unnest() %>%
distinct %>%
group_by(case_num) %>%
summarise(out = sum(listcites[type == 'claimant'] %in% listcites[type == 'court'])) #sum the overlap