Question

我有两个看起来像下面的数据框（df1中的'Content'列实际上是文章的完整内容，而不是像我的例子中只有一个句子）：

    PDF     Content
1   1234    This article is about bananas and pears and grapes, but also mentions apples and oranges, so much fun!
2   1111    Johannes writes about apples and oranges and that's great.
3   8000    Content that cannot be matched to the anything in df1.    
4   3993    There is an interesting piece on bananas plus kiwis as well.
    ...

（总计：5709个条目）

    Author        Title
1   Johannes      Apples and oranges
2   Peter         Bananas and pears and grapes
3   Hannah        Bananas plus kiwis
4   Helena        Mangos and peaches
    ...

（总计：10228个条目）

我想通过在df1的'Content'中搜索df2中的'Title'来合并两个数据帧。如果标题出现在内容的前2500个字符的某处，则表示匹配。注意：保留df1中的所有条目非常重要。相反，我只想保持df2中匹配的条目（即左连接）。注意：所有标题都是唯一值。

期望的输出（列序列无关紧要）：

    Author     Title                        PDF     Content
1   Peter      Bananas and pears and grapes 1234    This article is about bananas and pears and grapes, but also mentions apples and oranges, so much fun!
2   Johannes   Apples and oranges           1111    Johannes writes about apples and oranges and that's great.
3   NaN        NaN                          8000    Content that cannot be matched to the anything in df2.    
4   Hannah     Bananas plus kiwis           3993    There is an interesting piece on bananas plus kiwis as well.
    ...

我想我需要pd.merge和str.contains之间的组合，但我无法弄清楚如何！

Answer 1

警告：解决方案可能很慢:) 1.获得标题清单
2.根据标题列表顺序为df1创建索引 3.在idx上结束df1和df2

      PDF                                            Content    Author  \
0  1111.0  Johannes writes about apples and oranges and t...  Johannes
1  1234.0  This article is about bananas and pears and gr...     Peter
2  3993.0  There is an interesting piece on bananas plus ...    Hannah
3     NaN                                                NaN    Helena
4  8000.0  Content that cannot be matched to the anything...       NaN

                          Title
0            Apples and oranges
1  Bananas and pears and grapes
2            Bananas plus kiwis
3            Mangos and peaches
4                           NaN

输出

library(shiny)

ui <- fluidPage(
  mainPanel(column(3,
                   sliderInput("one","one",min = 0,max = 5,step = 1,value = 1),
                   sliderInput("two","two",min = 0,max = 5,step = 1,value = 1),
                   verbatimTextOutput("x"))
  )
)

server <- function(input,output,session){

  v <- reactiveValues(last = NULL)

  observe({
    lapply(names(input), function(x) {
      observe({
        input[[x]]
        v$last <- x
      })
    })
  })

  output$x <- renderPrint({paste0("Last Value changed is: ", v$last)})
}
shinyApp(ui, server)

Answer 2

你可以做一个完整的笛卡尔加入/交叉产品，然后过滤。由于您无法进行哈希查找，因此它不应该比等效的“Join”语句慢：

df1['key'] = 1
df2['key'] = 2
df3 = pd.merge(df1, df2, on='key')
df3['key'] = df3.apply(lambda row: row['Title'].lower() in row['Content'][:2500].lower(), axis=1)
df3 = df3.loc[df3['key'], ['PDF', 'Author', 'Title', 'Content']]

产生表格：

       PDF    Author                         Title  \
0   1234.0  Johannes            Apples and oranges
1   1234.0     Peter  Bananas and pears and grapes
4   1111.0  Johannes            Apples and oranges
14  3993.0    Hannah            Bananas plus kiwis

                                              Content
0   This article is about bananas and pears and gr...
1   This article is about bananas and pears and gr...
4   Johannes writes about apples and oranges and t...
14  There is an interesting piece on bananas plus ...

Python：结合str.contains并在pandas中合并

2 个答案: