我有两个看起来像下面的数据框(df1中的'Content'列实际上是文章的完整内容,而不是像我的例子中只有一个句子):
PDF Content
1 1234 This article is about bananas and pears and grapes, but also mentions apples and oranges, so much fun!
2 1111 Johannes writes about apples and oranges and that's great.
3 8000 Content that cannot be matched to the anything in df1.
4 3993 There is an interesting piece on bananas plus kiwis as well.
...
(总计:5709个条目)
Author Title
1 Johannes Apples and oranges
2 Peter Bananas and pears and grapes
3 Hannah Bananas plus kiwis
4 Helena Mangos and peaches
...
(总计:10228个条目)
我想通过在df1的'Content'中搜索df2中的'Title'来合并两个数据帧。如果标题出现在内容的前2500个字符的某处,则表示匹配。 注意:保留df1中的所有条目非常重要。相反,我只想保持df2中匹配的条目(即左连接)。 注意:所有标题都是唯一值。
期望的输出(列序列无关紧要):
Author Title PDF Content
1 Peter Bananas and pears and grapes 1234 This article is about bananas and pears and grapes, but also mentions apples and oranges, so much fun!
2 Johannes Apples and oranges 1111 Johannes writes about apples and oranges and that's great.
3 NaN NaN 8000 Content that cannot be matched to the anything in df2.
4 Hannah Bananas plus kiwis 3993 There is an interesting piece on bananas plus kiwis as well.
...
我想我需要pd.merge和str.contains之间的组合,但我无法弄清楚如何!
答案 0 :(得分:0)
警告:解决方案可能很慢:)
1.获得标题清单
2.根据标题列表顺序为df1创建索引
3.在idx上结束df1和df2
PDF Content Author \
0 1111.0 Johannes writes about apples and oranges and t... Johannes
1 1234.0 This article is about bananas and pears and gr... Peter
2 3993.0 There is an interesting piece on bananas plus ... Hannah
3 NaN NaN Helena
4 8000.0 Content that cannot be matched to the anything... NaN
Title
0 Apples and oranges
1 Bananas and pears and grapes
2 Bananas plus kiwis
3 Mangos and peaches
4 NaN
输出
library(shiny)
ui <- fluidPage(
mainPanel(column(3,
sliderInput("one","one",min = 0,max = 5,step = 1,value = 1),
sliderInput("two","two",min = 0,max = 5,step = 1,value = 1),
verbatimTextOutput("x"))
)
)
server <- function(input,output,session){
v <- reactiveValues(last = NULL)
observe({
lapply(names(input), function(x) {
observe({
input[[x]]
v$last <- x
})
})
})
output$x <- renderPrint({paste0("Last Value changed is: ", v$last)})
}
shinyApp(ui, server)
答案 1 :(得分:0)
你可以做一个完整的笛卡尔加入/交叉产品,然后过滤。由于您无法进行哈希查找,因此它不应该比等效的“Join”语句慢:
df1['key'] = 1
df2['key'] = 2
df3 = pd.merge(df1, df2, on='key')
df3['key'] = df3.apply(lambda row: row['Title'].lower() in row['Content'][:2500].lower(), axis=1)
df3 = df3.loc[df3['key'], ['PDF', 'Author', 'Title', 'Content']]
产生表格:
PDF Author Title \
0 1234.0 Johannes Apples and oranges
1 1234.0 Peter Bananas and pears and grapes
4 1111.0 Johannes Apples and oranges
14 3993.0 Hannah Bananas plus kiwis
Content
0 This article is about bananas and pears and gr...
1 This article is about bananas and pears and gr...
4 Johannes writes about apples and oranges and t...
14 There is an interesting piece on bananas plus ...