Python Pandas - 比较列文本并提供匹配的字数

时间:2016-05-25 08:53:59

标签: python pandas data-analysis text-analysis

我正在尝试开发一个字符串比较工具。我有两组json数据如下。

DF 1:

ID  Subject
1   Angular JS : getting unexpected cross symbol with Image
2   Cordova debug: the specified file was not found
3   get custom mask for phone numbers
4   Remove files for the Xcode Bots Unit Test Coverage
5   "Upload to Mongodb collection in aldeed:autoform
6   Mask for phone numbers

DF 2:

ID  Subject
1   Please provide custom mask for phone numbers
2   Files for the Xcode Bots Unit Test Coverage need to be removed
3   Upload to Mongodb collection

现在,使用python + pandas,对于表2中的每一个ID,我想在表1中找到一个匹配紧密的条目,单词序列无关紧要,需要从比较中消除特殊字符。

例如:

For ID 1 - ID 2 has 5 matching words
For ID 1 - ID 6 has 4 matching words
For ID 2 - ID 4 has 8 matching words
For ID 3 - ID 4 has 4 matching words

任何指针?

1 个答案:

答案 0 :(得分:1)

我认为您可以将先前的solutionregexmerge合并,groupby合并ID1ID2合并size }:

其他可能的解决方案是使用:

.replace(r'[\-\!\@\#\$\%\^\&\*\(\)\_\+\[\]\;\'\.\,\/\{\}\:\"\<\>\?\|]','')
df1 = df1.set_index('ID')
df2 = df2.set_index('ID')

df3 = (df1['Subject'].str
                     .replace(r'[^a-zA-Z\s]' , '')
                     .str
                     .lower()
                     .str
                     .split('\s+', expand=True)
                     .stack()
                     .reset_index(drop=True, level=1)
                     .reset_index(name='val'))

df4 = (df2['Subject'].str
                     .replace(r'[^a-zA-Z\s]' , '')
                     .str
                     .lower()
                     .str
                     .split('\s+', expand=True)
                     .stack()
                     .reset_index(drop=True, level=1)
                     .reset_index(name='val'))
df5 = (pd.merge(df3, df4, on='val', suffixes=('1','2')))
print (df5)
    ID1         val  ID2
0     2         the    2
1     4         the    2
2     3      custom    1
3     3        mask    1
4     6        mask    1
5     3         for    1
6     3         for    2
7     4         for    1
8     4         for    2
9     6         for    1
10    6         for    2
11    3       phone    1
12    6       phone    1
13    3     numbers    1
14    6     numbers    1
15    4       files    2
16    4       xcode    2
17    4        bots    2
18    4        unit    2
19    4        test    2
20    4    coverage    2
21    5      upload    3
22    5          to    2
23    5          to    3
24    5     mongodb    3
25    5  collection    3
print (df5.groupby(['ID1','ID2']).size().reset_index(name='c'))
   ID1  ID2  c
0    2    2  1
1    3    1  5
2    3    2  1
3    4    1  1
4    4    2  8
5    5    2  1
6    5    3  4
7    6    1  4
8    6    2  1