使用pandas按文本相似性合并两个数据帧

时间:2017-05-04 09:56:11

标签: python postgresql pandas text

我运行如下查询:

select * 
from sd_sms LEFT JOIN categories_phrases 
    on sd_sms.body like  concat('%',categories_phrases.phrase1,'%')
    and sd_sms.body like concat('%',categories_phrases.phrase2,'%')
    and sd_sms.body like concat('%',categories_phrases.phrase3,'%')
    and sd_sms.body like concat('%',categories_phrases.phrase4,'%')

基本上,如果表A中的一个字段包含表B中的几个短语,它将连接两个表。但现在我需要在Python中执行此操作。

有没有简单的方法可以使用pandas合并两个表,所以它给我相同的结果?

请告知

2 个答案:

答案 0 :(得分:1)

此代码示例适用于连接子句中的文本数据和类似条件。

from pandasql import *
import pandas as pd

pysqldf = lambda q: sqldf(q, globals())

df1 = pd.DataFrame({"name": ['Antony', 'Mark', 'Jacob'], "age":
                                                         [11,12,13]})
df2 = pd.DataFrame({"name": ['Antony', 'Gill', 'John']})

q = """SELECT * FROM df1 LEFT JOIN df2 ON df1.name LIKE '%' || df2.name || '%'"""

df = pysqldf(q)

这只是一个带有示例数据的虚拟DF,但我对你的问题应用了类似的条件。

希望它有用。

答案 1 :(得分:0)

我不明白您的数据类型是什么,因为您错过了答案中的一些示例数据;但如果您需要使用像sintax这样的SQL查询pandas数据帧,您可以尝试使用pandasql package.v它基于SQLAlchemy ORM工具。

from pandasql import *
import pandas as pd

pysqldf = lambda q: sqldf(q, globals())

q  = """
  SELECT
  m.date
  , m.beef
  , b.births
  FROM
  meat m
  LEFT JOIN
   births b
   ON m.date = b.date
  WHERE
  m.date > '1974-12-31';
  """

meat = load_meat()
births = load_births()

df = pysqldf(q)
df

date    beef    births
0   1975-01-01 00:00:00.000000  2106.0  265775.0
1   1975-02-01 00:00:00.000000  1845.0  241045.0
2   1975-03-01 00:00:00.000000  1891.0  268849.0
3   1975-04-01 00:00:00.000000  1895.0  247455.0
4   1975-05-01 00:00:00.000000  1849.0  254545.0
5   1975-06-01 00:00:00.000000  1849.0  254096.0
6   1975-07-01 00:00:00.000000  1916.0  275163.0
7   1975-08-01 00:00:00.000000  1961.0  281300.0
8   1975-09-01 00:00:00.000000  2065.0  270738.0
9   1975-10-01 00:00:00.000000  2270.0  265494.0
10  1975-11-01 00:00:00.000000  1970.0  251973.0
11  1975-12-01 00:00:00.000000  2055.0  260532.0
12  1976-01-01 00:00:00.000000  2208.0  257455.0
13  1976-01-01 00:00:00.000000  2208.0  259173.0
14  1976-02-01 00:00:00.000000  1966.0  236551.0
15  1976-02-01 00:00:00.000000  1966.0  238153.0
16  1976-03-01 00:00:00.000000  2318.0  257951.0
17  1976-03-01 00:00:00.000000  2318.0  261608.0
18  1976-04-01 00:00:00.000000  2015.0  246469.0
19  1976-04-01 00:00:00.000000  2015.0  250992.0
20  1976-05-01 00:00:00.000000  1969.0  256986.0
21  1976-05-01 00:00:00.000000  1969.0  261572.0
22  1976-06-01 00:00:00.000000  2161.0  250525.0
23  1976-06-01 00:00:00.000000  2161.0  255734.0
24  1976-07-01 00:00:00.000000  2111.0  279630.0
25  1976-07-01 00:00:00.000000  2111.0  279744.0
26  1976-08-01 00:00:00.000000  2233.0  279937.0
27  1976-08-01 00:00:00.000000  2233.0  286496.0
28  1976-09-01 00:00:00.000000  2274.0  273750.0
29  1976-09-01 00:00:00.000000  2274.0  283718.0
... ... ... ...
533 2010-06-01 00:00:00.000000  2320.0  NaN
534 2010-07-01 00:00:00.000000  2229.6  NaN
535 2010-08-01 00:00:00.000000  2286.6  NaN
536 2010-09-01 00:00:00.000000  2252.2  NaN
537 2010-10-01 00:00:00.000000  2234.9  NaN
538 2010-11-01 00:00:00.000000  2235.5  NaN
539 2010-12-01 00:00:00.000000  2270.9  NaN
540 2011-01-01 00:00:00.000000  2122.9  356457.0
541 2011-02-01 00:00:00.000000  2020.4  338521.0
542 2011-03-01 00:00:00.000000  2266.2  350630.0
543 2011-04-01 00:00:00.000000  2052.5  346397.0
544 2011-05-01 00:00:00.000000  2131.9  354886.0
545 2011-06-01 00:00:00.000000  2375.0  348587.0
546 2011-07-01 00:00:00.000000  2134.1  375384.0
547 2011-08-01 00:00:00.000000  2386.9  373333.0
548 2011-09-01 00:00:00.000000  2215.2  367965.0
549 2011-10-01 00:00:00.000000  2215.1  357875.0
550 2011-11-01 00:00:00.000000  2148.8  323788.0
551 2011-12-01 00:00:00.000000  2126.3  353871.0
552 2012-01-01 00:00:00.000000  2113.8  337980.0
553 2012-02-01 00:00:00.000000  2009.0  316641.0
554 2012-03-01 00:00:00.000000  2159.8  347803.0
555 2012-04-01 00:00:00.000000  1990.6  337272.0
556 2012-05-01 00:00:00.000000  2232.0  345257.0
557 2012-06-01 00:00:00.000000  2252.1  346971.0
558 2012-07-01 00:00:00.000000  2200.8  368450.0
559 2012-08-01 00:00:00.000000  2367.5  359554.0
560 2012-09-01 00:00:00.000000  2016.0  361922.0
561 2012-10-01 00:00:00.000000  2343.7  347625.0
562 2012-11-01 00:00:00.000000  2206.6  320195.0

这里是repo:https://github.com/yhat/pandasql和一个很好的快速入门教程http://blog.yhat.com/posts/pandasql-sql-for-pandas-dataframes.html