我有2个文本文件,其中包含用换行符分隔的单词。每个排序升序,每个约60MB。我需要获取第二个文件中不存在的所有单词(某种except
操作)。 2个文件中的单词数量不一定相等。
我想做的事情是依赖于2个文件被排序的事实,但并没有真正成功。我使用TPL来平行工作。我从一些事情开始,但不知道如何完成,如何平行工作。
我会感激任何帮助。
static StreamReader _streamReader1 = new StreamReader("file1.txt");
static StreamReader _streamReader2 = new StreamReader("file2.txt");
static IEnumerable<string> GetWordsFromFile1()
{
while (!_streamReader1.EndOfStream)
{
yield return _streamReader1.ReadLine();
}
}
static List<string> exceptedWords = new List<string>();
static void ExceptWords(string word)
{
//Here I believe I should read a word from 2nd file and somehow to compare to <word>
// and continue reading until word < word2?
}
static void Main(string[] args)
{
var words = GetWordsFromFile1();
Parallel.ForEach(words, ExceptWords);
}
答案 0 :(得分:3)
恕我直言,KISS因此而获胜:
var wordsFromFile1 = File.ReadAllLines("file1.txt");
var wordsFromFile2 = File.ReadAllLines("file2.txt");
var file1ExceptFile2 = wordsFromFile1.Except(wordsFromFile2);
如果您想要不区分大小写的比较:
var wordsFromFile1 = File.ReadAllLines("file1.txt");
var wordsFromFile2 = File.ReadAllLines("file2.txt");
var file1ExceptFile2 = wordsFromFile1.Except(wordsFromFile2, StringComparer.OrdinalIgnoreCase);
答案 1 :(得分:2)
也许这并没有直接回答你的问题,但我没有看到使用TPL
或依赖文件排序这一事实的简单方法。我相信LINQ
的{{1}}方法可以解决繁重的问题。由于文件不是天文数据巨大,因此将文件加载到内存应该不是问题。
Except
答案 2 :(得分:2)
虽然在测量简单情况并确定它不够“足够快”之前我不会使用这样的东西,但这是一种利用排序性质的脑死亡(而非并行)方法。还有其他/更好的方式来写这个,但想法是你可以启动两个'流'然后只是向前移动它们进行比较。
忽略边缘情况和开始/结束,你比较你的两个单词流中的每个单词的当前单词,并且'input'一个更少(保持它),它们匹配(跳过它)或更晚(向前移动'除'流。)
你可以保留当地人的东西,比如来自每个'流'之类的当前单词等,但恕我直言你最好忽略这种方法,或者做linq Except或SortedSet.ExceptWith,至少在你有实际轮廓测量显示您需要更复杂的东西。 :)
void Main()
{
var input = new[] { "abc", "bcd", "xyz", "zzz", };
var except = new[] { "abc", "xyz", };
ExceptSortedInputs(input, except).Dump();
}
// Define other methods and classes here
public static IEnumerable<string> ExceptSortedInputs(IEnumerable<string> inputSequence, IEnumerable<string> exceptSequence)
{
Contract.Requires<ArgumentNullException>(inputSequence != null);
Contract.Requires<ArgumentNullException>(exceptSequence != null);
var exceptEnumerator = exceptSequence.GetEnumerator();
Contract.Assert(exceptEnumerator.MoveNext(), "except sequence was empty, silly");
var inputEnumerator = inputSequence.GetEnumerator();
while (inputEnumerator.MoveNext())
{
// need to move the except sequence forward to ensure it's at or later than the current input word
while (String.Compare(inputEnumerator.Current, exceptEnumerator.Current) == 1)
{
if (exceptEnumerator.MoveNext() == false)
{
// stupid optimization - since there are no more except matches, we can just return the rest of the input
do
{
yield return inputEnumerator.Current;
}
while (inputEnumerator.MoveNext());
yield break;
}
}
// when we get here, we know the current 'except' word is equal to or later than the input one, so we can just check equality
if (inputEnumerator.Current != exceptEnumerator.Current)
{
yield return inputEnumerator.Current;
}
}
}
一个版本,使其看起来更像是典型合并连接的交错特性(并添加可能有助于清晰的局部视图)
void Main()
{
var input = new[] { "abc", "bcd", "xyz", "zzz", };
var except = new[] { "abc", "xyz", };
ExceptSortedInputs(input, except).Dump();
}
// Define other methods and classes here
public static IEnumerable<string> ExceptSortedInputs(IEnumerable<string> inputSequence, IEnumerable<string> exceptSequence)
{
var exceptEnumerator = exceptSequence.GetEnumerator();
var exceptStillHasElements = exceptEnumerator.MoveNext();
var inputEnumerator = inputSequence.GetEnumerator();
var inputStillHasElements = inputEnumerator.MoveNext();
while (inputStillHasElements)
{
if (exceptStillHasElements == false)
{
// since we exhausted the except sequence, we know we can safely return any input elements
yield return inputEnumerator.Current;
inputStillHasElements = inputEnumerator.MoveNext();
continue;
}
// need to compare to see which operation to perform
switch (String.Compare(inputEnumerator.Current, exceptEnumerator.Current))
{
case -1:
// except sequence is already later, so we can safely return this
yield return inputEnumerator.Current;
inputStillHasElements = inputEnumerator.MoveNext();
break;
case 0:
// except sequence has a match, so we can safely skip this
inputStillHasElements = inputEnumerator.MoveNext();
break;
case 1:
// except sequence is behind - we need to move it forward
exceptStillHasElements = exceptEnumerator.MoveNext();
}
}
}
答案 3 :(得分:1)
您要找的是合并加入。您可以稍微不同的形式使用此算法来计算以下任何一项:
当然还有其他人。我猜你在搜索那个特定的名字时会发现很多信息。
答案 4 :(得分:0)
我看到了已发布的答案,并认为“我想知道不同的方法是如何比较的?”
无论如何,我下载了2个字典文件,编写了时间码,并将发布的代码粘贴到vs2010中。
输出给出:
> ManningsBaseCase1: ElapsedTime: 0.1973, numOfIterations: 64
> ManningsBaseCase2: ElapsedTime: 0.2036, numOfIterations: 64
> KevinsLINQ1: ElapsedTime: 0.1803, numOfIterations: 64
> KevinsLINQ2: ElapsedTime: 0.1773, numOfIterations: 64
> ManningsOldMerge: ElapsedTime: 0.0797, numOfIterations: 128
> ManningsCleanMerge: ElapsedTime: 0.0800, numOfIterations: 256
每个人的代码运行足够的迭代次数超过10秒,然后进行每次迭代的平均值。
结果可能略有偏差 - 但我不想计算128次迭代的空For循环的长度来减去循环开销(左侧作为练习给读者)。
该代码还验证了每种方法都提供了相同的解决方案。
以下是代码:
class Program
{
private static readonly string filename1 = "DictoFile1.txt";
private static readonly string filename2 = "DictoFile2.txt";
private static readonly int numOfTests = 6;
private static readonly int MinTimingVal = 1000;
private static string[] testNames = new string[] {
"ManningsBaseCase1: ",
"ManningsBaseCase2: ",
"KevinsLINQ1: ",
"KevinsLINQ2: ",
"ManningsOldMerge: ",
"ManningsCleanMerge: "
};
private static string[] prev;
private static string[] next;
public static void Main(string[] args)
{
Console.WriteLine("Starting tests...");
Debug.WriteLine("Starting tests...");
Console.WriteLine("");
Debug.WriteLine("");
Action[] actionArray = new Action[numOfTests];
actionArray[0] = ManningsBaseCase1;
actionArray[1] = ManningsBaseCase2;
actionArray[2] = KevinsLINQ1;
actionArray[3] = KevinsLINQ2;
actionArray[4] = ManningsOldInterleaved;
actionArray[5] = ManningsCleanInterleaved;
for( int i = 0; i < actionArray.Length; i++ )
{
Console.Write(testNames[i]);
Debug.Write(testNames[i]);
Action a = actionArray[i];
DoTiming(a, i);
if (i > 0)
{
if (!ValidateLists())
{
Console.WriteLine(" --- Validation had an error.");
Debug.WriteLine(" --- Validation had an error.");
}
}
prev = next;
}
Console.WriteLine("");
Debug.WriteLine("");
Console.WriteLine("Tests complete.");
Debug.WriteLine("Tests complete.");
Console.WriteLine("Press Enter to Close Console...");
Debug.WriteLine("Press Enter to Close Console...");
Console.ReadLine();
}
private static bool ValidateLists()
{
if (prev == null) return false;
if (next == null) return false;
if (prev.Length != next.Length) return false;
for (int i = 0; i < prev.Length; i++)
{
if (prev[i] != next[i]) return false;
}
return true;
}
private static void DoTiming( Action a, int num )
{
a.Invoke();
Stopwatch watch = new Stopwatch();
Stopwatch loopWatch = new Stopwatch();
bool shouldRetry = false;
int numOfIterations = 2;
do
{
watch.Start();
for (int i = 0; i < numOfIterations; i++)
{
a.Invoke();
}
watch.Stop();
shouldRetry = false;
if (watch.ElapsedMilliseconds < MinTimingVal) //if the time was less than the minimum, increase load and re-time.
{
shouldRetry = true;
numOfIterations *= 2;
watch.Reset();
}
} while ( shouldRetry );
long totalTime = watch.ElapsedMilliseconds;
double avgTime = ((double)totalTime) / (double)numOfIterations;
Console.WriteLine("ElapsedTime: {0:N4}, numOfIterations: " + numOfIterations, avgTime/1000.00);
Debug.WriteLine("ElapsedTime: {0:N4}, numOfIterations: " + numOfIterations, avgTime / 1000.00);
}
private static void ManningsBaseCase1()
{
string[] wordsFromFile1 = File.ReadAllLines( filename1 );
string[] wordsFromFile2 = File.ReadAllLines( filename2 );
IEnumerable<string> file1ExceptFile2 = wordsFromFile1.Except(wordsFromFile2);
string[] asArray = file1ExceptFile2.ToArray();
next = asArray;
}
private static void ManningsBaseCase2()
{
string[] wordsFromFile1 = File.ReadAllLines( filename1 );
string[] wordsFromFile2 = File.ReadAllLines( filename2 );
IEnumerable<string> file1ExceptFile2 = wordsFromFile1.Except(wordsFromFile2, StringComparer.OrdinalIgnoreCase);
string[] asArray = file1ExceptFile2.ToArray();
next = asArray;
}
private static IEnumerable<string> GetWordsFromFile(StreamReader _streamReader)
{
while (!_streamReader.EndOfStream)
{
yield return _streamReader.ReadLine();
}
}
private static void KevinsLINQ1()
{
using (StreamReader _streamReader1 = new StreamReader(filename1))
{
using (StreamReader _streamReader2 = new StreamReader(filename2))
{
IEnumerable<string> words = GetWordsFromFile(_streamReader1)
.Except(GetWordsFromFile(_streamReader2));
string[] asArray = words.ToArray();
next = asArray;
}
}
}
private static void KevinsLINQ2()
{
using (StreamReader _streamReader1 = new StreamReader(filename1))
{
using (StreamReader _streamReader2 = new StreamReader(filename2))
{
IEnumerable<string> words = GetWordsFromFile(_streamReader1)
.Except(GetWordsFromFile(_streamReader2).AsParallel());
string[] asArray = words.ToArray();
next = asArray;
}
}
}
// Define other methods and classes here
public static IEnumerable<string> ExceptSortedInputsOld(IEnumerable<string> inputSequence, IEnumerable<string> exceptSequence)
{
IEnumerator<string> exceptEnumerator = exceptSequence.GetEnumerator();
IEnumerator<string> inputEnumerator = inputSequence.GetEnumerator();
while (inputEnumerator.MoveNext())
{
// need to move the except sequence forward to ensure it's at or later than the current input word
while (String.Compare(inputEnumerator.Current, exceptEnumerator.Current) == 1)
{
if (exceptEnumerator.MoveNext() == false)
{
// stupid optimization - since there are no more except matches, we can just return the rest of the input
do
{
yield return inputEnumerator.Current;
}
while (inputEnumerator.MoveNext());
yield break;
}
}
// when we get here, we know the current 'except' word is equal to or later than the input one, so we can just check equality
if (inputEnumerator.Current != exceptEnumerator.Current)
{
yield return inputEnumerator.Current;
}
}
}
private static void ManningsOldInterleaved()
{
IEnumerable<string> wordsFromFile1 = File.ReadLines(filename1);
IEnumerable<string> wordsFromFile2 = File.ReadLines(filename2);
IEnumerable<string> file1ExceptFile2 = ExceptSortedInputsOld(wordsFromFile1, wordsFromFile2);
string[] asArray = file1ExceptFile2.ToArray();
next = asArray;
}
private static IEnumerable<string> ExceptSortedInputsClean(IEnumerable<string> inputSequence, IEnumerable<string> exceptSequence)
{
IEnumerator<string> exceptEnumerator = exceptSequence.GetEnumerator();
bool exceptStillHasElements = exceptEnumerator.MoveNext();
IEnumerator<string> inputEnumerator = inputSequence.GetEnumerator();
bool inputStillHasElements = inputEnumerator.MoveNext();
while (inputStillHasElements)
{
if (exceptStillHasElements == false)
{
// since we exhausted the except sequence, we know we can safely return any input elements
yield return inputEnumerator.Current;
inputStillHasElements = inputEnumerator.MoveNext();
continue;
}
// need to compare to see which operation to perform
switch (String.Compare(inputEnumerator.Current, exceptEnumerator.Current))
{
case -1:
// except sequence is already later, so we can safely return this
yield return inputEnumerator.Current;
inputStillHasElements = inputEnumerator.MoveNext();
break;
case 0:
// except sequence has a match, so we can safely skip this
inputEnumerator.MoveNext();
break;
case 1:
// except sequence is behind - we need to move it forward
exceptStillHasElements = exceptEnumerator.MoveNext();
break;
}
}
}
private static void ManningsCleanInterleaved()
{
IEnumerable<string> wordsFromFile1 = File.ReadLines(filename1);
IEnumerable<string> wordsFromFile2 = File.ReadLines(filename2);
IEnumerable<string> file1ExceptFile2 = ExceptSortedInputsClean(wordsFromFile1, wordsFromFile2);
string[] asArray = file1ExceptFile2.ToArray();
next = asArray;
}
}
只需复制并粘贴到VS2010 .Net 4.0中,添加txt文件和使用,它应该可以正常工作。
注意:我将MinTimingVal更改为1秒,而不是10秒。
所以,无论如何,Manning的Merge方法比其他人更胜一筹。
良好的工作人员。
所有人都说,我仍然认为可以通过使用FileStream类并行化文件输入。在同一个文件上创建两个不同的FileStream,在开头有1个开始,并且让另一个Seek()或将其.Position设置到文件的中间并从那里读取。
如果我解决这个问题,我可以尝试一下,看看并行化的I / O操作是否真的可以加快速度。