按类别比较组之间的字符串匹配

时间:2017-01-10 23:50:23

标签: r

我有一个大型数据集,其形式如下:从语料库中提取文本字符串:

var mInput: InputStream? = nil
var mOutput: OutputStream? = nil
var mConverter: DatabaseConverterXml? = nil
var mType: ConverterTypeSynchronization? = nil

func urlSession(_ session: URLSession, task: URLSessionTask, needNewBodyStream completionHandler: @escaping (InputStream?) -> Void)
{
    //Create the input and output stream and bind them, so that what the 
    //output stream writes ends up in the buffer of the input stream.
    let bufferSize: Int = 1024
    Stream.getBoundStreams(withBufferSize: bufferSize, inputStream: &mInput, outputStream: &mOutput)

    //This part is not really important for you, it starts the generation of 
    //the XML, which is written directly to the output stream.
    mConverter = DatabaseConverterXml(prettyPrint: false)
    mType = ConverterTypeSynchronization(progressAlert: nil)

    mType.convert(using: mConverter, writingTo: [Writable.Stream(mOutput!)])
    {
        successfull in
        print("Conversion Complete! Successfull: \(successfull)" )
    }

    //The input stream is then handed over via the 
    //completion handler of the delegate method.
    completionHandler(mInput!)
}

...其中A和B在类别1中具有共同的string1和string3;第2类没有共同点;和第3类中的所有三个共同点。

我希望获得每个班级的A组和B组之间匹配的字符串数量。字符串匹配可以按任何顺序排列;例如针对c(string2,string1)计算的c(string1,string2)应计为两个匹配项。此外,匹配应仅在每个类别中的唯一字符串之间;例如c(string1,string2),c(string1)与c(string2,string1)应该仍然只是两个匹配。例如:

Category      Group         Text_Strings 
1             A             c(string1, string2, string3)
1             A             c(string1, string3)
1             B             character(0)
1             B             c(string1)
1             B             c(string3)

2             A             character(0)
2             A             character(0)
2             B             c(string1, string3)

3             A             c(string1, string2, string3)
3             A             character(0)
3             A             c(string1)
3             B             character(0)
3             B             c(string1, string2, string3)

...即使重复了string1,也只会产生一个匹配。

我的最终输出应如下所示:

Category      Group         Text_Strings 
4             A             c(string1, string2, string3)
4             A             c(string1)
4             B             c(string1)
4             B             c(string1)

我做了很多研究,但我自己找不到答案。在我看来,我可以通过Group对数据帧进行子集化,以某种方式在Categories上聚合/连接字符串,然后使用lapply()和intersect()...类似

Category     Matches
1            2
2            0 
3            3
4            1

当然这是缺少的步骤而且不起作用,但我是否在正确的轨道上?谢谢你的帮助!

更新:jeremycg的解决方案非常有用,但我的数据非常混乱,以至于它不会接受parse()。感谢另一个用户在另一个线程中,我通过基于逗号分隔符拆分行来解决这个问题,而不是直接尝试删除:

for(i in 1:nrow(data)[1]) {
    data$matches[i] <- sum(intersect(subset(data, Group=="A")$Text_Strings[i], 
                                     subset(data, Group=="B")$Text_Strings[i])) 
}

这产生了相同的unnested数据,但很多更清洁。

1 个答案:

答案 0 :(得分:2)

你可以使用dplyr和tidyr:

library(dplyr)
library(tidyr)
x %>% unnest() %>% #spread out the nested columns
      distinct() %>% #remove dupes
      group_by(Category) %>% #by Category
      summarise(out = sum(Text_Strings[Group  == 'A'] %in% Text_Strings[Group  == 'B'])) #sum the overlap

,并提供:

Source: local data frame [3 x 2]

  Category   out
     (int) (int)
1        1     2
2        2     0
3        3     3

您的实际数据非常混乱 - 您应该尝试修复输出为“长格式”的任何内容。这是一个笨重的解决方案:

x$listcites =  gsub('\\\\n', '',x$listcites) #remove newlines
x$listcites = gsub("\"", "'", x$listcites, fixed = TRUE) #remove quotes to singles
x$listcites[grepl('^[^c]',x$listcites)] = paste("c('", x$listcites[grepl('^[^c]',x$listcites)],"')", sep = '') #fix single lines to same format
x$listcites = sapply(x$listcites, function(x) eval(parse(text = x))) #eval to vecs in dataframe
x %>% unnest() %>%
      distinct %>%
      group_by(case_num) %>%
      summarise(out = sum(listcites[type  == 'claimant'] %in% listcites[type  == 'court'])) #sum the overlap