使用正则表达式在字符串中查找自定义html标记,在数组中单独显示文本和图像URL

时间:2015-03-23 07:12:41

标签: c# arrays regex image

我有一个要求,我需要从身体中提取网址。我想从原始主体中删除标签,然后按顺序将它们添加到数组中。

例如我想转换这个字符串 -

Muffin powder chocolate candy jelly icing cotton candy. Oat cake
danish bear claw tootsie roll donut pie. Toffee chupa chups brownie
cupcake pudding sweet roll dessert jelly-o. <blobUrl=https://google.com/img1.png> Gummies macaroon pudding
marzipan. Chocolate cake biscuit muffin tart jelly-o carrot cake.
Liquorice dessert gummi bears icing danish. Ice cream marshmallow
candy marzipan cupcake. Sweet lollipop dragée chocolate cheesecake
chocolate gummies sesame snaps. <blobUrl=https://google.com/img with space.png> Lollipop jelly bear claw danish jelly
beans chocolate. Pudding cake gingerbread dessert halvah jelly
marzipan. Gingerbread oat cake dragée cake cake marzipan. Oat cake
lemon drops pudding bear claw soufflé lollipop biscuit pudding.

对于数组集合,类似这样 -

arrayVariable[0] = "Muffin powder chocolate candy jelly icing cotton candy. Oat cake danish bear claw tootsie roll donut pie. Toffee chupa chups brownie
cupcake pudding sweet roll dessert jelly-o.

arrayVariable[1] = "https://google.com/img1.png"

arrayVariable[2] = "Gummies macaroon pudding
marzipan. Chocolate cake biscuit muffin tart jelly-o carrot cake.
Liquorice dessert gummi bears icing danish. Ice cream marshmallow
candy marzipan cupcake. Sweet lollipop dragée chocolate cheesecake
chocolate gummies sesame snaps."

arrayVariable[3] = "https://google.com/img with space.png"

arrayVariable[4] = "Lollipop jelly bear claw danish jelly
beans chocolate. Pudding cake gingerbread dessert halvah jelly
marzipan. Gingerbread oat cake dragée cake cake marzipan. Oat cake
lemon drops pudding bear claw soufflé lollipop biscuit pudding."

到目前为止,我尝试使用此正则表达式

    var bodyToParse = bodyText;

    string re1 = ".*?"; // Non-greedy match on filler
    string re2 = "((?:http|https)(?::\\/{2}[\\w]+)(?:[\\/|\\.]?)(?:[^\\s\"]*))";    // HTTP URL 1

    Regex r = new Regex(re1 + re2, RegexOptions.IgnoreCase | RegexOptions.Singleline);
    Match m = r.Match(bodyToParse);
    if (m.Success)
    {
        String httpurl1 = m.Groups[1].ToString();
        Debug.WriteLine("(" + httpurl1.ToString() + ")" + "\n");
    }

哪种方法效果很好但无法弄清楚如何将所有内容分成字符串列表。

1 个答案:

答案 0 :(得分:1)

您可以使用此代码段。它可能不是最好的,但它应该获取所有网址并将文本的所有部分组合成List<string>变量lst

var tststr = @"Muffin powder chocolate candy jelly icing cotton candy. Oat cake
danish bear claw tootsie roll donut pie. Toffee chupa chups brownie
cupcake pudding sweet roll dessert jelly-o. <blobUrl=https://google.com/img1.png> Gummies macaroon pudding
marzipan. Chocolate cake biscuit muffin tart jelly-o carrot cake.
Liquorice dessert gummi bears icing danish. Ice cream marshmallow
candy marzipan cupcake. Sweet lollipop dragée chocolate cheesecake
chocolate gummies sesame snaps. <blobUrl=https://google.com/img with space.png> Lollipop jelly bear claw danish jelly
beans chocolate. Pudding cake gingerbread dessert halvah jelly
marzipan. Gingerbread oat cake dragée cake cake marzipan. Oat cake
lemon drops pudding bear claw soufflé lollipop biscuit pudding.";
var lst = new List<string>();
var former_idx = 0;
for (var m = Regex.Match(tststr, @"\s*<blobUrl=(http[^>]+)>\s*"); m.Success; m = m.NextMatch())
{
    lst.Add(tststr.Substring(former_idx, m.Index - former_idx));
    lst.Add(m.Groups[1].Value);
    former_idx = m.Index + m.Value.Length;
}
if (former_idx < tststr.Length)
    lst.Add(tststr.Substring(former_idx, tststr.Length - former_idx));