我有一个文本文件,我从中读取文本行。同样从所有文本中我需要找到最长的句子并找到它开始的哪一行。我找到最长的句子并没有问题但是当我需要找到它开始的地方时会出现问题。
文本文件的内容为:
诉M.普京纳斯 Margi sakalai
Lydėdamigęstančiąžarąvėlai
Pakiloįdangų;;,margi sakalai Paniekinęžemėsvylingussapnus,
Padangėjeištiesė,,, savo sparnus。
Irtarėmargieji:negrįšimįžemę,
Kol josios kalnaiirpakalnėsaptemę。
我的代码:
static void Sakiniai (string fv, string skyrikliai)
{
char[] skyrikliaiSak = { '.', '!', '?' };
string naujas = "";
string[] lines = File.ReadAllLines(fv, Encoding.GetEncoding(1257));
foreach (string line in lines)
{
// Add lines into a string so I can separate them into sentences
naujas += line;
}
// Separating into sentences
string[] sakiniai = naujas.Split(skyrikliaiSak);
// This method finds the longest sentence
string ilgiausiasSak = RastiIlgiausiaSakini(sakiniai);
}
从文本文件中,最长的句子是:"Margi sakalai Lydėdami gęstančią žarą vėlai Pakilo į dangų;;, margi sakalai"
如何找到句子开头的确切行?
答案 0 :(得分:1)
嵌套for
循环怎么样?如果两个句子是相同的长度,那么只需找到第一个。
var lines = File.ReadAllLines(fv, Encoding.GetEncoding(1257));
var terminators = new HashSet<char> { '.', '?', '!' };
var currentLength = 0;
var currentSentence = new StringBuilder();
var maxLength = 0;
var maxLine = default(int?);
var maxSentence = "";
for (var currentLine = 0; currentLine < lines.Count(); currentLine++)
{
foreach (var character in lines[currentLine])
{
if (terminators.Contains(character))
{
if (currentLength > maxLength)
{
maxLength = currentLength;
maxLine = currentLine;
maxSentence = currentSentence.ToString();
}
currentLength = 0;
currentSentence.Clear();
}
else
{
currentLength++;
currentSentence.Append(character);
}
}
}
答案 1 :(得分:0)
首先找到整个内容中最长句子的起始索引
int startIdx = naujas.IndexOf(ilgiausiasSak);
然后循环这些行以找出startIdx落入哪一行
int i = 0;
while (i < lines.Length && startIdx >= 0)
{
startIdx -= lines[i].Length;
i++;
}
// do stuff with i
我是最长句子的起点。例如i = 2表示从第二行开始
答案 2 :(得分:0)
怎么样
最后句子的某些部分可能出现在形成其他句子的多行上,因此你需要正确识别传播连续行的句子
// define separators for various contexts
var separator = new
{
Lines = new[] { '\n' },
Sentences = new[] { '.', '!', '?' },
Sections = new[] { '\n' },
};
// isolate the lines and their corresponding number
var lines = paragraph
.Split(separator.Lines, StringSplitOptions.RemoveEmptyEntries)
.Select((text, number) => new
{
Number = number += 1,
Text = text,
})
.ToList();
// isolate the sentences with corresponding sections and line numbers
var sentences = paragraph
.Split(separator.Sentences, StringSplitOptions.RemoveEmptyEntries)
.Select(sentence => sentence.Trim())
.Select(sentence => new
{
Text = sentence,
Length = sentence.Length,
Sections = sentence
.Split(separator.Sections)
.Select((section, index) => new
{
Index = index,
Text = section,
Lines = lines
.Where(line => line.Text.Contains(section))
.Select(line => line.Number)
})
.OrderBy(section => section.Index)
})
.OrderByDescending(p => p.Length)
.ToList();
// build the possible combinations of sections within a sentence
// and filter only those that are on consecutive lines
var results = from sentence in sentences
let occurences = sentence.Sections
.Select(p => p.Lines)
.Cartesian()
.Where(p => p.Consecutive())
.SelectMany(p => p)
select new
{
Text = sentence.Text,
Length = sentence.Length,
Lines = occurences,
};
,最终结果如下所示
其中 .Cartesian 和 .Consecutive 只是可枚举的一些辅助扩展方法(请参阅linqpad ready格式的整个源代码的相关gist)< / p>
public static IEnumerable<T> Yield<T>(this T instance)
{
yield return instance;
}
public static IEnumerable<IEnumerable<T>> Cartesian<T>(this IEnumerable<IEnumerable<T>> instance)
{
var seed = Enumerable.Empty<T>().Yield();
return instance.Aggregate(seed, (accumulator, sequence) =>
{
var results = from vector in accumulator
from item in sequence
select vector.Concat(new[]
{
item
});
return results;
});
}
public static bool Consecutive(this IEnumerable<int> instance)
{
var distinct = instance.Distinct().ToList();
return distinct
.Zip(distinct.Skip(1), (a, b) => a + 1 == b)
.All(p => p);
}
答案 3 :(得分:0)
构建解决问题的索引。
我们可以直接修改您现有的代码:
var lineOffsets = new List<int>();
lineOffsets.Add(0);
foreach (string line in lines)
{
// Add lines into a string so I can separate them into sentences
naujas += line;
lineOffsets.Add(naujas.Length);
}
好的;现在,您有一个与每行对应的最终字符串中的字符偏移列表。
你有一个大字符串的子字符串。您可以使用IndexOf
来查找大字符串中子字符串的偏移量。然后,您可以搜索列表以查找小于或等于的 last 元素的列表索引。这是行号。
如果列表很大,您可以二进制搜索它。