我创建了一个简单的脚本来在两个字符串之间进行评分。请在下面找到USQL和BackEnd .net代码
CN_Matcher.usql:
REFERENCE ASSEMBLY master.FuzzyString;
@searchlog =
EXTRACT ID int,
Input_CN string,
Output_CN string
FROM "/CN_Matcher/Input/sample.txt"
USING Extractors.Tsv();
@CleansCheck =
SELECT ID,Input_CN, Output_CN, CN_Validator.trial.cleanser(Input_CN) AS Input_CN_Cleansed,
CN_Validator.trial.cleanser(Output_CN) AS Output_CN_Cleansed
FROM @searchlog;
@CheckData= SELECT ID,Input_CN, Output_CN, Input_CN_Cleansed, Output_CN_Cleansed,
CN_Validator.trial.Hamming(Input_CN_Cleansed, Output_CN_Cleansed) AS HammingScore,
CN_Validator.trial.LevinstienDistance(Input_CN_Cleansed, Output_CN_Cleansed) AS LevinstienDistance,
FuzzyString.ComparisonMetrics.JaroWinklerDistance(Input_CN_Cleansed, Output_CN_Cleansed) AS JaroWinklerDistance
FROM @CleansCheck;
OUTPUT @CheckData
TO "/CN_Matcher/CN_Full_Run.txt"
USING Outputters.Tsv();
CN_Matcher.usql.cs:
using Microsoft.Analytics.Interfaces;
using Microsoft.Analytics.Types.Sql;
using System;
using System.Collections.Generic;
using System.IO;
using System.Linq;
using System.Text;
namespace CN_Validator
{
public static class trial
{
public static string cleanser(string val)
{
List<string> wordsToRemove = "l.p. registered pc bldg pllc lp. l.c. div. national l p l.l.c international r. limited school azioni joint co-op corporation corp., (corp) inc., societa company llp liability l.l.l.p llc bancorporation manufacturing c dst (inc) jv ltd. llc. technology ltd., s.a. mfg rllp incorporated per venture l.l.p c. p.l.l.c l.p.. p. partnership corp co-operative s.p.a tech schl bancorp association lllp n r ltd inc. l.l.p. p.c. co district int intl assn. sa inc l.p co, co. division lc intl. lp professional corp. a l. l.l.c. building r.l.l.p co.,".Split(' ').ToList();
return string.Join(" ", val.ToLower().Split(' ').Except(wordsToRemove));
}
public static int Hamming(string source, string target)
{
int distance = 0;
if (source.Length == target.Length)
{
for (int i = 0; i < source.Length; i++)
{
if (!source[i].Equals(target[i]))
{
distance++;
}
}
return distance;
}
else { return 99999; }
}
public static int LevinstienDistance(string source, string target)
{
int n = source.Length;
int m = target.Length;
int[,] d = new int[n + 1, m + 1]; // matrix
int cost; // cost
// Step 1
if (n == 0) return m;
if (m == 0) return n;
for (int i = 0; i <= n; d[i, 0] = i++) ;
for (int j = 0; j <= m; d[0, j] = j++) ;
for (int i = 1; i <= n; i++)
{
for (int j = 1; j <= m; j++)
{
cost = (target.Substring(j - 1, 1) == source.Substring(i - 1, 1) ? 0 : 1);
d[i, j] = System.Math.Min(System.Math.Min(d[i - 1, j] + 1, d[i, j - 1] + 1),
d[i - 1, j - 1] + cost);
}
}
return d[n, m];
}
}
}
我运行了一个包含100个输入的样本批处理,并将并行度设置为1,优先级设置为1000. 作业在1.6分钟内完成。
我想用1000个输入测试相同的工作,并将并行度设置为1,优先级设置为1000,并根据我的计算,因为100个输入需要1.6分钟,我认为1000个输入需要大约20分钟但是它运行超过50分钟,我没有看到任何进展。
所以我添加了另外100个输入作业并测试它与上一次相同。所以我想提高并行度并将其增加到3并再次运行它甚至在1小时后仍未完成。
JOB_ID = 07c0850d-0770-4430-a288-5cddcfc26699
主要问题是我无法看到任何进展或状态。
如果我做错了,请告诉我。
无论如何在USQL中使用构造函数?因为如果我能够做到这一点,我将不需要一次又一次地做同样的清洁步骤。
答案 0 :(得分:2)
我假设您使用文件集语法指定1000个文件?遗憾的是,文件集的当前默认实现不能很好地扩展,并且编译(准备)阶段将花费很长时间(执行也是如此)。我们目前在预览中有更好的实现。你能给我发一封邮件给 usql at Microsoft dot com ,我会告诉你如何试用预览实现。
由于 迈克尔
答案 1 :(得分:0)
我看了一个更基于集合的方法。例如,不要在代码隐藏文件中保留要删除的单词,而是将它们保存在U-SQL表中,以便于添加到:
CREATE TABLE IF NOT EXISTS dbo.wordsToRemove
(
word string,
INDEX cdx_wordsToRemvoe CLUSTERED (word ASC)
DISTRIBUTED BY HASH (word)
);
INSERT INTO dbo.wordsToRemove ( word )
SELECT word
FROM (
VALUES
( "l.p." ),
( "registered" ),
( "pc" ),
( "bldg" ),
( "pllc" ),
( "lp." ),
( "l.c." ),
( "div." ),
( "national" ),
( "l" ),
( "p" ),
( "l.l.c" ),
( "international" ),
( "r." ),
( "limited" ),
( "school" ),
( "azioni" ),
( "joint" ),
( "co-op" ),
( "corporation" ),
( "corp.," ),
( "(corp)" ),
( "inc.," ),
( "societa" ),
( "company" ),
( "llp" ),
( "liability" ),
( "l.l.l.p" ),
( "llc" ),
( "bancorporation" ),
( "manufacturing" ),
( "c" ),
( "dst" ),
( "(inc)" ),
( "jv" ),
( "ltd." ),
( "llc." ),
( "technology" ),
( "ltd.," ),
( "s.a." ),
( "mfg" ),
( "rllp" ),
( "incorporated" ),
( "per" ),
( "venture" ),
( "l.l.p" ),
( "c." ),
( "p.l.l.c" ),
( "l.p.." ),
( "p." ),
( "partnership" ),
( "corp" ),
( "co-operative" ),
( "s.p.a" ),
( "tech" ),
( "schl" ),
( "bancorp" ),
( "association" ),
( "lllp" ),
( "n" ),
( "r" ),
( "ltd" ),
( "inc." ),
( "l.l.p." ),
( "p.c." ),
( "co" ),
( "district" ),
( "int" ),
( "intl" ),
( "assn." ),
( "sa" ),
( "inc" ),
( "l.p" ),
( "co," ),
( "co." ),
( "division" ),
( "lc" ),
( "intl." ),
( "lp" ),
( "professional" ),
( "corp." ),
( "a" ),
( "l." ),
( "l.l.c." ),
( "building" ),
( "r.l.l.p" ),
( "co.," )
) AS words(word);
然后进行比较,我将原来的短语分开,删除了我们不想要的单词,然后再将短语重新组合在一起,如下所示:
//DECLARE @inputFile string = "input/input.csv"; // 500 companies, Standard & Poor 500 companies from wikipedia
DECLARE @inputFile string = "input/input2.csv"; // 850,000 companies, part 1 of extract from Companies House
@searchlog =
EXTRACT id int,
Input_CN string,
Output_CN string
FROM @inputFile
USING Extractors.Csv(silent : true);
//USING Extractors.Csv(skipFirstNRows:1);
// Split the input string to remove unwanted words
@Input_CN =
SELECT id,
new SQL.ARRAY<string>(Input_CN.Split(' ')) AS splitWords
FROM @searchlog;
@Output_CN =
SELECT id,
new SQL.ARRAY<string>(Output_CN.Split(' ')) AS splitWords
FROM @searchlog;
// Remove unwanted words from input string
@Input_CN =
SELECT *
FROM
(
SELECT o.id,
x.splitWord.ToLower() AS splitWord
FROM @Input_CN AS o
CROSS APPLY
EXPLODE(splitWords) AS x(splitWord)
) AS y
ANTISEMIJOIN
dbo.wordsToRemove AS w
ON y.splitWord == w.word;
// Remove unwanted words from output string
@Output_CN =
SELECT *
FROM
(
SELECT o.id,
x.splitWord.ToLower() AS splitWord
FROM @Output_CN AS o
CROSS APPLY
EXPLODE(splitWords) AS x(splitWord)
) AS y
ANTISEMIJOIN
dbo.wordsToRemove AS w
ON y.splitWord == w.word;
// Put the input string back together again
@Input_CN =
SELECT id,
String.Join( " ", ARRAY_AGG (splitWord) ) AS Input_CN_Cleansed
FROM @Input_CN
GROUP BY id;
@Output_CN =
SELECT id,
String.Join( " ", ARRAY_AGG (splitWord) ) AS Output_CN_Cleansed
FROM @Output_CN
GROUP BY id;
@output =
SELECT i.id,
i.Input_CN_Cleansed,
o.Output_CN_Cleansed,
CN_Validator.trial.Hamming(i.Input_CN_Cleansed, o.Output_CN_Cleansed) AS HammingScore,
CN_Validator.trial.LevinstienDistance(i.Input_CN_Cleansed, o.Output_CN_Cleansed) AS LevinstienDistance
FROM @Input_CN AS i
INNER JOIN
@Output_CN AS o
ON i.id == o.id;
OUTPUT @output
TO "/output/output.csv"
USING Outputters.Csv();
我发现性能相似,但设计可能更易于维护。我的代码只用了几分钟就运行了850 + k记录,而不是50多分钟,所以可能还有另外一个问题。 NB我错过了FuzzyString库,所以在我的测试中没有包含它 - 它可以解释差异。
如果您从Microsoft获得此更新,请回复此主题,如果您愿意,甚至将其标记为答案。