我想记录列表中两个字符串的差异位置(删除它们)...最好记录每个部分的最高分隔点,因为这些区域将具有动态内容。
比较这些
总字符数178.两个独特的部分
string DefaultDirectory = @"D:\Test\ExcelFiles\";
string[] ExcelFilePaths;
DataTable BindingSourceDataTable = new DataTable();
BindingSource GridBindingSource = new BindingSource();
public Form1() {
InitializeComponent();
}
private void Form1_Load(object sender, EventArgs e) {
BindingSourceDataTable = GetDataFromExcel();
GridBindingSource.DataSource = BindingSourceDataTable;
dataGridView1.DataSource = GridBindingSource;
}
public DataTable GetDataFromExcel() {
ExcelFilePaths = Directory.GetFiles(DefaultDirectory, "*.xlsx", SearchOption.AllDirectories);
DataTable mergedTables = new DataTable();
string FilePath;
for (int i = 0; i < ExcelFilePaths.Length; i++) {
FilePath = ExcelFilePaths[i];
DataTable tempTable = ImportExcel(FilePath);
if (tempTable != null) { // <- ignore workbook files missing the worksheet named 'Sheet1'
if (i < ExcelFilePaths.Length - 1) { // <- if its the last worksheet do not add the extra row
tempTable.Rows.Add();
}
mergedTables.Merge(tempTable);
}
}
return mergedTables;
}
public DataTable ImportExcel(string FilePath) {
string ConnStr = @"Provider=Microsoft.ACE.OLEDB.12.0;Data Source=" + FilePath + ";Extended Properties=\"Excel 12.0 Xml;HDR=YES\";";
using (OleDbConnection Conn = new OleDbConnection(ConnStr)) {
try {
DataTable dt = new DataTable();
OleDbDataAdapter DA = new OleDbDataAdapter("select * from [Sheet1$]", Conn);
DA.Fill(dt);
return dt;
}
catch (Exception e) {
// ignore workbook files missing the worksheet named 'Sheet1'
//MessageBox.Show("No worksheet named 'Sheet1' - Error: " + e.Message);
return null;
}
}
}
private void tbFirstName_TextChanged(object sender, EventArgs e) {
GridBindingSource.Filter = GetFilterStringFromTextBoxes();
}
private void tbLastName_TextChanged(object sender, EventArgs e) {
GridBindingSource.Filter = GetFilterStringFromTextBoxes();
}
private string GetFilterStringFromTextBoxes() {
return string.Format("Customer_Firstname LIKE '%{0}%' and Customer_Lastname LIKE '%{1}%' ", tbFirstName.Text, tbLastName.Text);
}
和
总字符211.两个独特的部分
t1 = 'WhereTisthetotalnumberofght5y5wsjhhhhjhkmhm Thethreemethodsthatreturntheratioofmatchingtototalcharacterscangivedifferentresultsduetodifferinglevelsofapxxxxxxxproximation,although'
我知道 difflib 可以做到这一点,但输出结果不好。
我想存储(在列表中)字符位置,最好是较大的分隔值。
模式位置
t2 = 'WhereTisthetotalnumberofdofodfgjnjndfgu><rgregw><sssssuguyguiygis>gggs<GS,Gs Thethreemethodsthatreturntheratioofmatchingtototalcharacterscangivedifferentrexxxxxxxsultsduetodifferinglevelsofapproximation,although'
输出:
t1 = 'WhereTisthetotalnumberof 24 ght5y5wsjhhhhjhkmhm 43 Thethreemethodsthatreturntheratioofmatchingtototalcharacterscangivedifferentresultsduetodifferinglevelsofap 151 xxxxxxx 158 proximation,although'
t2 = 'WhereTisthetotalnumberof 24 dofodfgjnjndfgu><rgregw><sssssuguyguiygis>gggs<GS,Gs 76 Thethreemethodsthatreturntheratioofmatchingtototalcharacterscangivedifferentre 155 xxxxxxx 162 sultsduetodifferinglevelsofapproximation,although'
更新
回复帖子@Olivier
所有Y的位置由***分开
output list = [24, 76, 151, 162]
在 matcher.get_matching_blocks()之后输出
和t1
WhereTisthetotalnumberofght5***y***5wsjhhhhjhkmhm Thethreemethodsthatreturntheratioofmatchingtototalcharacterscangivedifferentresultsduetodifferinglevelsofapxxxxxxxproximation,although
t2 WhereTisthetotalnumberofdofodfgjnjndfgu><rgregw><sssssugu***y***gui***y***gis>gggs<GS,Gs Thethreemethodsthatreturntheratioofmatchingtototalcharacterscangivedifferentrexxxxxxxsultsduetodifferinglevelsofapproximation,although
string = ''.join([t1[a:a+n] for a, _, n in blocks])
答案 0 :(得分:1)
使用difflib
可能是您最好的选择,因为您不可能提供比其提供的算法更有效的解决方案。你想要的是使用SequenceMatcher.get_matching_blocks
。以下是根据doc输出的内容。
返回描述匹配子序列的三元组列表。每三个 格式为
(i, j, n)
,表示a[i:i+n] == b[j:j+n]
。该 三元组在 i 和 j 中单调递增。
这是一种可以使用它来重建删除了delta的字符串的方法。
from difflib import SequenceMatcher
x = "abc_def"
y = "abc--ef"
matcher = SequenceMatcher(None, x, y)
blocks = matcher.get_matching_blocks()
# blocks: [Match(a=0, b=0, size=4), Match(a=5, b=5, size=2), Match(a=7, b=7, size=0)]
string = ''.join([x[a:a+n] for a, _, n in blocks])
# string: "abcef"
编辑:还有人指出,如果你有两个这样的字符串。
t1 = 'WordWordaayaaWordWord'
t2 = 'WordWordbbbybWordWord'
然后上面的代码将返回'WordWordyWordWord
。这是因为get_matching_blocks
将捕获预期块之间的两个字符串中存在的'y'
。解决此问题的方法是按长度过滤返回的块。
string = ''.join([x[a:a+n] for a, _, n in blocks if n > 1])
如果您想要对返回的块进行更复杂的分析,您还可以执行以下操作。
def block_filter(substring):
"""Outputs True if the substring is to be merged, False otherwise"""
...
string = ''.join([x[a:a+n] for a, _, n in blocks if block_filter(x[a:a+n])])