如何使用子字符串来提取某些文本

时间:2014-06-03 17:55:56

标签: c#

我有以下C#代码从HTML源中提取某些文本:

string url = txtURL.Text;
string pageCode = WorkerClass.getSourceCode(url);
int startIndex = pageCode.IndexOf("</B>");
pageCode = pageCode.Substring(startIndex, pageCode.Length - startIndex);
StreamWriter sw = new StreamWriter("websitesource.txt");
sw.Write(pageCode);
sw.Close();

以上代码将以下内容写入文本文件:

</B> WILLIAMS AJAYA L                     <BR>                                                                      
<B>Address : </B> NEW YORK            NY                                          <BR>                                        
<B>Profession : </B> ATHLETIC TRAINER                          <BR>                                                           
<B>License No: </B> 001475 <BR>                                                                                            
<B>Date of Licensure : </B> 01/12/07      <BR>                                                                                
<B>Additional Qualification : </B>     &nbsp; Not applicable in this profession                       <BR>                    
<B> <A href="http://www.op.nysed.gov/help.htm#status"> Status :</A></B> REGISTERED                                        <BR>
<B>Registered through last day of : </B> 08/15      <BR>
<HR><div class ="note">                                                                                                       
* Use of this online verification service signifies that you have read and agree to the                                       
<A href="http://www.op.nysed.gov/usage.htm">terms and conditions of use</A>.   

如何在forloop中使用代码来存储字符串数组中的文本(修剪周围的任何空格)?

所以字符串数组应该像这样:

string[] ar = {
"WILLIAMS AJAYA L",
"NEW YORK          NY",
"ATHLETIC TRAINER",
"001475",
"01/12/07",
"Not applicable in this profession",
"REGISTERED",
"08/15"

2 个答案:

答案 0 :(得分:3)

var lines =  File.ReadLines("websitesource.txt")
                .Select(line =>
                    line.Substring(line.LastIndexOf("</B>") + 4)
                        .Replace("<BR>", "")
                        .Trim())
                        .ToArray();

答案 1 :(得分:2)

我使用字符串拆分命令与Selman22略有不同。我也删除&nbsp;并替换为空格。此外,无论换行位于何处,都可以使用(因为HTML不需要任何特定格式)。

var split = File.ReadAllText(FILENAME)
                .Replace("<BR>", "").Replace("&nbsp;", " ")
                .Split(new[] {"<B>", "</B>"}, StringSplitOptions.RemoveEmptyEntries)
                .Where((x, i) => i%2 == 0)
                .Select(y => y.Trim()).ToList();

split.ForEach(Console.WriteLine);
Console.ReadKey();

这一点的重要部分是确保您的数据始终采用这种格式 - 因为HTML可以经常更改,对DOM的简单更改将完全摒弃您的解析。

祝你好运!