python 3,两个字符串之间的差异

时间:2018-02-12 00:15:58

标签: string python-3.x difflib

我想记录列表中两个字符串的差异位置(删除它们)...最好记录每个部分的最高分隔点,因为这些区域将具有动态内容。

比较这些

总字符数178.两个独特的部分

string DefaultDirectory = @"D:\Test\ExcelFiles\";
string[] ExcelFilePaths;
DataTable BindingSourceDataTable = new DataTable();
BindingSource GridBindingSource = new BindingSource();

public Form1() {
  InitializeComponent();
}

private void Form1_Load(object sender, EventArgs e) {
  BindingSourceDataTable = GetDataFromExcel();
  GridBindingSource.DataSource = BindingSourceDataTable;
  dataGridView1.DataSource = GridBindingSource;
}

public DataTable GetDataFromExcel() {
  ExcelFilePaths = Directory.GetFiles(DefaultDirectory, "*.xlsx", SearchOption.AllDirectories);
  DataTable mergedTables = new DataTable();
  string FilePath;
  for (int i = 0; i < ExcelFilePaths.Length; i++) {
    FilePath = ExcelFilePaths[i];
    DataTable tempTable = ImportExcel(FilePath);
    if (tempTable != null) {  // <- ignore workbook files missing the worksheet named 'Sheet1'
      if (i < ExcelFilePaths.Length - 1) { // <- if its the last worksheet do not add the extra row 
        tempTable.Rows.Add();
      }
      mergedTables.Merge(tempTable);
    }
  }
  return mergedTables;
}

public DataTable ImportExcel(string FilePath) {
  string ConnStr = @"Provider=Microsoft.ACE.OLEDB.12.0;Data Source=" + FilePath + ";Extended Properties=\"Excel 12.0 Xml;HDR=YES\";";
  using (OleDbConnection Conn = new OleDbConnection(ConnStr)) {
    try {
      DataTable dt = new DataTable();
      OleDbDataAdapter DA = new OleDbDataAdapter("select * from [Sheet1$]", Conn);
      DA.Fill(dt);
      return dt;
    }
    catch (Exception e) {
      // ignore workbook files missing the worksheet named 'Sheet1'
      //MessageBox.Show("No worksheet named 'Sheet1' - Error: " + e.Message);
      return null;
    }
  }
}

private void tbFirstName_TextChanged(object sender, EventArgs e) {
  GridBindingSource.Filter = GetFilterStringFromTextBoxes();
}

private void tbLastName_TextChanged(object sender, EventArgs e) {
  GridBindingSource.Filter = GetFilterStringFromTextBoxes();
}

private string GetFilterStringFromTextBoxes() {
  return string.Format("Customer_Firstname LIKE '%{0}%' and Customer_Lastname LIKE '%{1}%' ", tbFirstName.Text, tbLastName.Text);
}

总字符211.两个独特的部分

t1 = 'WhereTisthetotalnumberofght5y5wsjhhhhjhkmhm Thethreemethodsthatreturntheratioofmatchingtototalcharacterscangivedifferentresultsduetodifferinglevelsofapxxxxxxxproximation,although'

我知道 difflib 可以做到这一点,但输出结果不好。

我想存储(在列表中)字符位置,最好是较大的分隔值。

模式位置

t2 = 'WhereTisthetotalnumberofdofodfgjnjndfgu><rgregw><sssssuguyguiygis>gggs<GS,Gs Thethreemethodsthatreturntheratioofmatchingtototalcharacterscangivedifferentrexxxxxxxsultsduetodifferinglevelsofapproximation,although'

输出:

t1 = 'WhereTisthetotalnumberof  24  ght5y5wsjhhhhjhkmhm  43  Thethreemethodsthatreturntheratioofmatchingtototalcharacterscangivedifferentresultsduetodifferinglevelsofap  151  xxxxxxx  158  proximation,although'
t2 = 'WhereTisthetotalnumberof  24  dofodfgjnjndfgu><rgregw><sssssuguyguiygis>gggs<GS,Gs  76  Thethreemethodsthatreturntheratioofmatchingtototalcharacterscangivedifferentre  155  xxxxxxx  162  sultsduetodifferinglevelsofapproximation,although'

更新

回复帖子@Olivier

所有Y的位置由***分开

output list = [24, 76, 151, 162]
matcher.get_matching_blocks()之后

输出 和t1 WhereTisthetotalnumberofght5***y***5wsjhhhhjhkmhm Thethreemethodsthatreturntheratioofmatchingtototalcharacterscangivedifferentresultsduetodifferinglevelsofapxxxxxxxproximation,although t2 WhereTisthetotalnumberofdofodfgjnjndfgu><rgregw><sssssugu***y***gui***y***gis>gggs<GS,Gs Thethreemethodsthatreturntheratioofmatchingtototalcharacterscangivedifferentrexxxxxxxsultsduetodifferinglevelsofapproximation,although

string = ''.join([t1[a:a+n] for a, _, n in blocks])

1 个答案:

答案 0 :(得分:1)

使用difflib可能是您最好的选择,因为您不可能提供比其提供的算法更有效的解决方案。你想要的是使用SequenceMatcher.get_matching_blocks。以下是根据doc输出的内容。

  

返回描述匹配子序列的三元组列表。每三个   格式为(i, j, n),表示a[i:i+n] == b[j:j+n]。该   三元组在 i j 中单调递增。

这是一种可以使用它来重建删除了delta的字符串的方法。

from difflib import SequenceMatcher

x = "abc_def"
y = "abc--ef"

matcher = SequenceMatcher(None, x, y)
blocks = matcher.get_matching_blocks()

# blocks: [Match(a=0, b=0, size=4), Match(a=5, b=5, size=2), Match(a=7, b=7, size=0)]

string = ''.join([x[a:a+n] for a, _, n in blocks])

# string: "abcef"

编辑:还有人指出,如果你有两个这样的字符串。

t1 = 'WordWordaayaaWordWord'
t2 = 'WordWordbbbybWordWord'

然后上面的代码将返回'WordWordyWordWord。这是因为get_matching_blocks将捕获预期块之间的两个字符串中存在的'y'。解决此问题的方法是按长度过滤返回的块。

string = ''.join([x[a:a+n] for a, _, n in blocks if n > 1])

如果您想要对返回的块进行更复杂的分析,您还可以执行以下操作。

def block_filter(substring):
    """Outputs True if the substring is to be merged, False otherwise"""
    ...


string = ''.join([x[a:a+n] for a, _, n in blocks if block_filter(x[a:a+n])])