Question

当我从Feed中获取数据并通过正则表达式提取内容时，我仍然有（＆amp; o＃8230;，＆amp; o＃8211;，＆amp; o＃8220等... [我添加了o在前两个中，所以他们会在我的内容文本中重新格式化。可悲的是，这些也是饲料内容的来源。任何正则表达式，我自己尝试了一些但没有成功：＆amp;＃[0-9] {4};

我的代码：

protected override void OnNavigatedTo(System.Windows.Navigation.NavigationEventArgs e)
    {

        try
        {        

            SyndicationItem sItem = IsolatedStorageSettings.ApplicationSettings["postovi"] as SyndicationItem; //stores the user chosed item to be displayed
            List <string> CC_List =  IsolatedStorageSettings.ApplicationSettings["ContentList"] as List<string>; //title and content are pulled from feed and put in list

            PageTitle.Text = sItem.Title.Text; 
            PageTitle.FontSize = 40;

            foreach (var item in CC_List)
            {
                int i;

                if (item == PageTitle.Text)
                {
                    i = CC_List.IndexOf(item, 0); //index naslova u listi
                    String content = CC_List[i + 1];
                    content = Regex.Replace(content, @"(?<startTag><\s*script[^>]*>)(?<content>[\s\S]*?)(?<endTag><\s*/script[^>]*>)", string.Empty);
                    Match link = Regex.Match(content, @"(?<=<img\s+[^>]*?src=(?<q>['""]))(?<url>.+?)(?=\k<q>)", RegexOptions.Singleline);
                    content = Regex.Replace(content, @"(?></?\w+)(?>(?:[^>'""]+|'[^']*'|""[^""]*"")*)>", string.Empty);
                    content = Regex.Replace(content, "&nbsp;", string.Empty);
                    Uri uri = new Uri(link.Value);
                    slika_clanak.Source = ImageFromUri(link.Value); // gets image
                    content = Regex.Replace(content, @"<p>.*</p>", string.Empty);

                    clanak_textblock.Text = content.Trim(); // reads article text and puts it on screen

                }

            }

Answer 1

您是否尝试过HttpUtility.HtmlDecode方法？这是System.Net程序集中的标准，我不能确切地说它是否也可以在WP7上使用。

Answer 2

尽管我发表了评论，但我意识到第二个选项可能是Html Agility Pack，其中找到了wp7.5二进制文件here。您可能会遇到SO上发布的问题，并在此帖http://htmlagilitypack.codeplex.com/discussions/282469中回应，以包含某些用于编译的库。我提到它的原因是有一个非常强大的HtmlEncode类，它构建了所有实体的字典。您可能无法直接使用DeEntitize（），但是如果需要，您可以研究如何构建可以去掉所有内容的东西。

我个人不想手工制作正则表达式，我会使用为我构建的这样的东西，然后遍历我认为相关的所有内容。当然这是手机，所以你最好不要根据具体情况进行剥离，但如果饲料不断变化并且你没有足够的样本数据来构建，那就很难了。

如何从文本中删除字符实体编号

我的代码：

2 个答案: