使用分隔符按句子拆分文章

时间:2017-01-07 12:04:48

标签: c# c#-4.0

我有一个小作业,我的文章格式就像这样

<REUTERS TOPICS="NO" LEWISSPLIT="TRAIN" CGISPLIT="TRAINING-SET" OLDID="5545" NEWID="2">
<TITLE>STANDARD OIL &lt;SRD> TO FORM FINANCIAL UNIT</TITLE>
<DATELINE>    CLEVELAND, Feb 26 - </DATELINE><BODY>Standard Oil Co and BP North America
Inc said they plan to form a venture to manage the money market
borrowing and investment activities of both companies.
BP North America is a subsidiary of British Petroleum Co
Plc &lt;BP>, which also owns a 55 pct interest in Standard Oil.
The venture will be called BP/Standard Financial Trading
and will be operated by Standard Oil under the oversight of a
joint management committee.

Reuter
&#3;</BODY></TEXT>
</REUTERS>

我将其写入具有此格式的新xml文件

<article id= some id >
      <subject>articles subject </subject>
      <sentence> sentence #1 </sentence>
      .
      .
      .
      <sentence> sentence #n </sentence>
 </article>

我编写了一个代码,可以完成所有这些并且工作正常。

问题在于我使用分隔符.来分割句子,但是如果有一个像2.00这样的数字,则代码认为2是句子而00是不同的句子。

有没有人知道如何更好地识别句子,以便将数字保存在同一个句子中?

无需遍历所有阵列?

如果在分隔符之前和之后有数字,我有没有办法让string.Split()方法忽略拆分?

我的代码:

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
using System.IO;
using System.Data; 
using System.Xml;
namespace project
{
    class Program
    {
        static void Main(string[] args)
        {
            string[] lines = System.IO.File.ReadAllLines(@"path");
            string body = "";
            REUTERS article = new REUTERS();
            string sentences = "";
            for (int i = 0; i<lines.Length;i++){
                string line = lines[i];
                // finding the first tag of the article
                if (line.Contains("<REUTERS"))
                {
                    //extracting the id from the tag
                    int Id = line.IndexOf("NEWID=\"") + "NEWID=\"".Length;
                    article.NEWID = line.Substring(Id, line.Length-2 - Id); 
                }
                if (line.Contains("TITLE"))
                {
                    string subject = line;
                    subject = subject.Replace("<TITLE>", "").Replace("</TITLE>", "");

                    article.TITLE = subject;
                }
                if( line.Contains("<BODY"))
                {
                    int startLoc = line.IndexOf("<BODY>") + "<BODY>".Length;
                    sentences = line.Substring(startLoc, line.Length - startLoc);    
                    while (!line.Contains("</BODY>"))
                    {
                        i++;
                        line = lines[i];
                        sentences = sentences +" " + line;
                    }
                    int endLoc = sentences.IndexOf("</BODY>");
                    sentences = sentences.Substring(0, endLoc);
                    char[] delim = {'.'};
                    string[] sentencesSplit = sentences.Split(delim);

                    using (System.IO.StreamWriter file =
                       new System.IO.StreamWriter(@"path",true))
                    {
                        file.WriteLine("<articles>");
                        file.WriteLine("\t <article id = " + article.NEWID + ">");
                        file.WriteLine("\t \t <subject>" + article.TITLE + "</subject>");

                        foreach (string sentence in sentencesSplit)
                        {
                            file.WriteLine("\t \t <sentence>" + sentence + "</sentence>");
                        }
                        file.WriteLine("\t </article>");
                        file.WriteLine("</articles>");
                    }
                }
            }
        }

        public class REUTERS
        {
            public string NEWID;
            public string TITLE;
            public string Body;
        }
    }
}

4 个答案:

答案 0 :(得分:0)

好的,所以我找到了一个解决方案,使用我在这里收到的想法 我使用像这样的分裂的重载方法

.Split(new string[] { ". " }, StringSplitOptions.None);

现在看起来好多了

答案 1 :(得分:0)

您还可以使用正则表达式查找带有空格的句子终结符:

read

请注意,这适用于英语,因为其他语言的句子可能还有其他规则。

以下示例

true

生成此输出

var pattern = @"(?<=[\.!\?])\s+";
var sentences = Regex.Split(input, pattern);

foreach (var sentence in sentences) {
    //do something with the sentence
    var node = string.Format("\t \t <sentence>{0}</sentence>", sentence);
    file.WriteLine(node);
}

答案 2 :(得分:0)

我会列出'。'的所有索引点。字符。

foreach索引点,检查每一侧的数字,如果数字在两边,则从列表中删除索引点。

然后当您输出时,只需使用带有剩余索引点的子字符串函数将每个句子作为一个单独的句子。

质量错误代码(已经很晚了):

indexesToRemove = indexesToRemove.OrderByDescending();

下一行是我们在最后一步中遍历列表时不必更改删除号码。

foreach(int indexPoint in indexesToRemove)
{
    IndexPoints.RemoveAt(indexPoint);
}

现在我们只删除任何一方都有数字的'。'的所有位置。

sentences.substring(lastIndexPoint+1, currentIndexPoint)

现在,当您将句子读出为新文件格式时,您只需循环(define (my-sum lst) (foldr + 0 lst))

答案 3 :(得分:0)

花了很多时间在这上面 - 认为你可能希望看到它,因为它确实没有使用任何尴尬的代码 - 它产生的输出99%与你的相似。

<articles>
    <article id="2">
        <subject>STANDARD OIL &lt;SRD&gt; TO FORM FINANCIAL UNIT</subject>
        <sentence>Standard Oil Co and BP North America</sentence>
        <sentence>Inc said they plan to form a venture to manage the money market</sentence>
        <sentence>borrowing and investment activities of both companies.</sentence>
        <sentence>BP North America is a subsidiary of British Petroleum Co</sentence>
        <sentence>Plc &lt;BP&gt;, which also owns a 55.0 pct interest in Standard Oil.</sentence>
        <sentence>The venture will be called BP/Standard Financial Trading</sentence>
        <sentence>and will be operated by Standard Oil under the oversight of a</sentence>
        <sentence>joint management committee.</sentence>
    </article>
</articles>

控制台应用程序如下:

using System.Xml;
using System.IO;

namespace ReutersXML
{
    class Program
    {
        static void Main()
        {
            XmlDocument xmlDoc = new XmlDocument();

            xmlDoc.Load("reuters.xml");

            var reuters = xmlDoc.GetElementsByTagName("REUTERS");
            var article = reuters[0].Attributes.GetNamedItem("NEWID").Value;
            var subject = xmlDoc.GetElementsByTagName("TITLE")[0].InnerText;
            var body = xmlDoc.GetElementsByTagName("BODY")[0].InnerText;

            string[] sentences = body.Split(new string[] { System.Environment.NewLine },
                System.StringSplitOptions.RemoveEmptyEntries);

            using (FileStream fileStream = new FileStream("reuters_new.xml", FileMode.Create))
            using (StreamWriter sw = new StreamWriter(fileStream))
            using (XmlTextWriter xmlWriter = new XmlTextWriter(sw))
            {
                xmlWriter.Formatting = Formatting.Indented;
                xmlWriter.Indentation = 4;

                xmlWriter.WriteStartElement("articles");
                xmlWriter.WriteStartElement("article");
                xmlWriter.WriteAttributeString("id", article);
                xmlWriter.WriteElementString("subject", subject);

                foreach (var s in sentences)
                    if (s.Length > 10)
                        xmlWriter.WriteElementString("sentence", s);

                xmlWriter.WriteEndElement();
            }
        }
    }
}

我希望你喜欢它:)。