我试图从包含“>”的字符串中删除HTML标记在HTML标签之间。我尝试使用正则表达式,但它删除了我想保留的字符串片段。
以下是我正在尝试删除所有HTML标记的字符串作为示例。 http://pastebin.com/0aqn12Gh
如您所见,用户制作的代码中包含“>”的评论在他们中
所以有人知道如何使用正则表达式做到这一点?我也是在VB.net中这样做,如果有什么比这更好的人可以推荐这样做
答案 0 :(得分:1)
看起来你正在尝试解析一些JSON数据,因此我建议首先将JSON解析为实际对象。
Public Class User
Private mPosts As Post()
Property posts() As Post()
Get
Return mPosts
End Get
Set(ByRef Value as Post())
mPosts = Value
End Set
End Property
End Class
Public Class Post
Private mNo As Integer
Private mNow As String
Private mName As String
Private mCom As String
Private mFilename As String
Private mExt As String
Private mW As Integer
Private mH As Integer
Private mTn_w As Integer
Private mTn_h As Integer
Private mTim As ULong
Private mTime As Integer
Private mMd5 As String
Private mFsize As Integer
Private mResto As Integer
Private mBumplimit As Integer
Private mImagelimit As Integer
Private mReplies As Integer
Private mImages As Integer
Property no() As Integer
Get
Return mNo
End Get
Set(ByVal Value As Integer)
mNo = Value
EndSet
End Property
Property now() As String
Get
Return mNow
End Get
Set(ByRef Value As String)
mNow = Value
End Set
End Property
' Et Cetera
End Class
收到JSON后:
var serializer = new JavaScriptSerializer();
var userData = serializer.Deserialize<User>(jsonText);
然后,解析JSON对象的“com”属性,就像HTML所在的那样:
var cleanText = Regex.Replace(userData.posts[k].com, "<([^>]+)>", "");
' posts[k] is assuming you're iterating through the posts array with an iterator named k
免责声明:VB.Net不是我习以为常的语言,因此上述代码中可能存在语法或样式错误。我之前从未使用过JavaScriptSerializer;我在此代码中的使用仅基于阅读文档。
答案 1 :(得分:1)
如果你只是想删除周围的对(看起来像span
)而不删除其他标签,这也可能是一种方式。
查找:
# <[^<>]+>((?:(?:(?!<[^<>]+>|>).)*>(?:(?!<[^<>]+>|>).)*)+)<[^<>]+>
< [^<>]+ >
(
(?:
(?:
(?! < [^<>]+ > | > )
.
)*
>
(?:
(?! < [^<>]+ > | > )
.
)*
)+
)
< [^<>]+ >
替换:$1
答案 2 :(得分:0)
我认为你担心评论可能包含&gt;和&lt;不属于标签的字符,您希望保留这些字符。以下正则表达式应该这样做:
<[^<>]*>
我个人不认识VB.net,但是从快速谷歌我会说你需要:
Regex.Replace(input, "<[^<>]*>", "");
Working Example on RegExr - 您可以将其设置为替换,并将替换字符串设置为空以查看输出。