如何从包含c#中特定单词的字符串中获取文本?

时间:2016-04-09 18:55:21

标签: c#

我有一个字符串:

  

在1690年的波士顿,本杰明·哈里斯发表了“广告发生”(Forreign)和“穹顶棒”(Domestick)。这被认为是美国殖民地的第一份报纸,尽管在该报被政府镇压之前只发表了一期。 1704年,州长允许出版“波士顿新闻通讯”,并成为殖民地第一份连续出版的报纸。不久之后,每周的报纸开始在纽约和费城出版。这些早期报纸采用英国格式,通常长达四页。他们主要从英国传播新闻,内容取决于编辑的兴趣。 1783年,宾夕法尼亚晚报成为第一个美国日报。

我想编写我的程序代码,只从上面的文本中提取一个句子。

例如,如果有人输入TextBox单词`governor',输出应显示:

  

1704年,州长允许出版“波士顿新闻通讯”,并成为殖民地第一份连续出版的报纸。

我已经尝试自己做了,到目前为止我已编码:

string searchWithinThis = "In Boston in 1690, Benjamin Harris published Publick Occurrences Both Forreign and Domestick. This is considered the first newspaper in the American colonies even though only one edition was published before the paper was suppressed by the government. In 1704, the governor allowed The Boston News-Letter to be published and it became the first continuously published newspaper in the colonies. Soon after, weekly papers began publishing in New York and Philadelphia. These early newspapers followed the British format and were usually four pages long. They mostly carried news from Britain and content depended on the editor's interests. In 1783, the Pennsylvania Evening Post became the first American daily.";
string searchForThis = "governor";
int middle = searchWithinThis.IndexOf(searchForThis);

我的想法是,我能找到第一个'。'在“州长”这个词之前,最后是“'”。在“州长”之后#39;然后使用子字符串用`governor"'来提取句子。我不知道如何首先找到IndexOf'。'在“州长”之间。

2 个答案:

答案 0 :(得分:2)

啊,啊哈,正规救援!

[^\.]*\bgovernor\b[^\.]*

代码段:https://regex101.com/r/mB7fM7/2

代码:

static void Main(string[] args)
{
    var textToSearch = "governor";
    var textToSearchIn = "In Boston in 1690, Benjamin Harris published Publick Occurrences Both Forreign and Domestick. This is considered the first newspaper in the American colonies even though only one edition was published before the paper was suppressed by the government. In 1704, the governor allowed The Boston News-Letter to be published and it became the first continuously published newspaper in the colonies. Soon after, weekly papers began publishing in New York and Philadelphia. These early newspapers followed the British format and were usually four pages long. They mostly carried news from Britain and content depended on the editor's interests. In 1783, the Pennsylvania Evening Post became the first American daily.";
    var pattern = String.Format("[^\\.]*\\b{0}\\b[^\\.]*", textToSearch);

    if (Regex.IsMatch(textToSearchIn, pattern))
    {
        foreach (var matchedItem in Regex.Matches(textToSearchIn, pattern))
        {
            Console.WriteLine(matchedItem);
            Console.WriteLine();
        }
    }

    var lastMatch = Regex.Matches(textToSearchIn, pattern).Cast<Match>().Last();

    Console.Read();
}

编辑:使用\b改进了字匹配代码,并为多个匹配改进了Regex.MatchCollection

答案 1 :(得分:1)

一种方法是将字符串拆分为序列,然后找到正确的字符串:

var sequence = searchWithinThis.Split('.').FirstOrDefault(s => s.Contains(searchForThis));

虽然它不像IndexOf那样优化,但如果您的文字非常长,则可能会出现问题。

否则,您可以执行以下操作:

var index = searchWithinThis.IndexOf(searchForThis);

if (index != -1)
{
    int startIndex = 0;
    int endIndex = searchWithinThis.Length;

    for (int i = index + searchForThis.Length; i < searchWithinThis.Length; i++) 
    {
        if (searchWithinThis[i] == '.') 
        {
            endIndex = i;
            break;
        }
    }

    for (int i = index - 1; i >= 0; i--) 
    {
        if (searchWithinThis[i] == '.')
        {
            startIndex = i + 1;
            break;
        }
    }

    var sequence = searchWithinThis.Substring(startIndex, endIndex - startIndex);
}