以下是我正在尝试使用的字符串的链接:http://pastebin.com/raw/TRKbqGxs
是的,我知道正则表达式不是最好的解析HTML,但我想在这个项目中使用它。我现在不想使用HTML Ability Pack。
我的主要兴趣是这些行:
data-screen-name="thanhbach195" data-name="Mai Thanh Bách" data-protected="false">
data-screen-name="zeref980" data-name="Yan Naung Htet" data-protected="false">
我想在以下文本块data-screen-name="
和" data-name=
之间导出数据。
基本上,在这种情况下thanhbach195
和zeref980
。
我尝试了以下正则表达式:string reg = "data-screen-name=\"(.*)\" data-name=\"";
但由于某种原因,我没有得到多场比赛的答案。事实上,它似乎并没有提取我想要的字符串。
如果有人可以帮我写一个正确的正则表达式来提取我在上面上传的字符串中的那两个字符串之间的所有匹配,我将不胜感激。
private List<string> getUsers(string str)
{
List<string> users = new List<string>();
string reg = "data-screen-name=\"(.*)\" data-name=\"";
MatchCollection mc = Regex.Matches(str, reg);
foreach(Match m in mc)
{
users.Add(m.Groups[1].Value);
}
return users;
}
此代码每次返回相同的匹配(我相信它是第一个)。
答案 0 :(得分:0)
尝试一个非贪婪的命名正则表达式,它终止于下一个双引号而不是贪婪的匹配
string reg = "data-screen-name=\"(?<dataScreenName>[^\"]+)\" data-name=\"(?<dataName>[^\"]+\"";
这将允许您使用
foreach(Match m in mc)
{
users.Add(m.Groups["dataScreenName"].Value);
}
答案 1 :(得分:0)
你应该真正使用HTML解析引擎,因为有许多模糊的边缘情况,正则表达式无法容纳。但我不是你的妈妈,所以我不会告诉你如何过你的生活。
由于您似乎可以控制源文本,因此您可能能够完全避免模糊的边缘情况。
此正则表达式将执行以下操作:
div
代码div
代码都包含class="user-actions
,包含或不包含引用data-screen-name
,data-name
和data-protected
的值捕获到自己的捕获组中正则表达式
<div\b(?=(?:[^>=]|='[^']*'|="[^"]*"|=[^'"][^\s>]*)*?class=['"]?user-actions)(?=(?:[^>=]|='[^']*'|="[^"]*"|=[^'"][^\s>]*)*?data-screen-name=(['"]?)(.*?)\1(?:\s|>))(?=(?:[^>=]|='[^']*'|="[^"]*"|=[^'"][^\s>]*)*?data-name=(['"]?)(.*?)\3(?:\s|>))(?=(?:[^>=]|='[^']*'|="[^"]*"|=[^'"][^\s>]*)*?data-protected=(['"]?)(.*?)\5(?:\s|>))(?:[^>=]|='(?:[^']|\\')*'|="(?:[^"]|\\")*"|=[^'"][^\s>]*)*\s*>
我建议使用不区分大小写的标记。
摘自源文字
示例文本的一小部分
<div class="user-actions btn-group not-following not-muting protected" data-user-id="726459723365502976"
data-screen-name="Just__Kidding__" data-name="Chaw Chin Fong" data-protected="true">
直播示例
显示整个文件的一大部分,因为在线工具陷入了巨大的文本串。
https://regex101.com/r/bY1kH8/1
捕获论坛
data-screen-name
值data-screen-name
值,不包括任何引号data-name
值data-name
值,不包括任何引号data-protected
值data-protected
值,不包括任何引号样本匹配
这些是使用建议的正则表达式从源文本中获取的。
[0][0] = <div class="user-actions btn-group not-following not-muting " data-user-id="2582252852"
data-screen-name="w33haa" data-name="Aliwi Omar" data-protected="false">
[0][1] = "
[0][2] = w33haa
[0][3] = "
[0][4] = Aliwi Omar
[0][5] = "
[0][6] = false
[1][0] = <div class="user-actions btn-group not-following not-muting " data-user-id="1680222842"
data-screen-name="Jamjomon" data-name="Jamchu :3" data-protected="false">
[1][1] = "
[1][2] = Jamjomon
[1][3] = "
[1][4] = Jamchu :3
[1][5] = "
[1][6] = false
[2][0] = <div class="user-actions btn-group not-following not-muting " data-user-id="1523823648"
data-screen-name="dimakoza4enko" data-name="Дима Козаченко" data-protected="false">
[2][1] = "
[2][2] = dimakoza4enko
[2][3] = "
[2][4] = Дима Козаченко
[2][5] = "
[2][6] = false
[3][0] = <div class="user-actions btn-group not-following not-muting " data-user-id="1522238240"
data-screen-name="alupulipulipala" data-name="Wahid Arefin" data-protected="false">
[3][1] = "
[3][2] = alupulipulipala
[3][3] = "
[3][4] = Wahid Arefin
[3][5] = "
[3][6] = false
[4][0] = <div class="user-actions btn-group not-following not-muting " data-user-id="4804204573"
data-screen-name="thanhbach195" data-name="Mai Thanh Bách" data-protected="false">
[4][1] = "
[4][2] = thanhbach195
[4][3] = "
[4][4] = Mai Thanh Bách
[4][5] = "
[4][6] = false
[5][0] = <div class="user-actions btn-group not-following not-muting " data-user-id="726465523223908353"
data-screen-name="zeref980" data-name="Yan Naung Htet" data-protected="false">
[5][1] = "
[5][2] = zeref980
[5][3] = "
[5][4] = Yan Naung Htet
[5][5] = "
[5][6] = false
[6][0] = <div class="user-actions btn-group not-following not-muting protected" data-user-id="726459723365502976"
data-screen-name="Just__Kidding__" data-name="Chaw Chin Fong" data-protected="true">
[6][1] = "
[6][2] = Just__Kidding__
[6][3] = "
[6][4] = Chaw Chin Fong
[6][5] = "
[6][6] = true
[7][0] = <div class="user-actions btn-group not-following not-muting " data-user-id="713605605638938624"
data-screen-name="Fruitcentre" data-name="Fruit & Veg Centre" data-protected="false">
[7][1] = "
[7][2] = Fruitcentre
[7][3] = "
[7][4] = Fruit & Veg Centre
[7][5] = "
[7][6] = false
[8][0] = <div class="user-actions btn-group not-following not-muting protected" data-user-id="555968644"
data-screen-name="aeronhalecastle" data-name="Eywon ツ" data-protected="true">
[8][1] = "
[8][2] = aeronhalecastle
[8][3] = "
[8][4] = Eywon ツ
[8][5] = "
[8][6] = true
[9][0] = <div class="user-actions btn-group not-following not-muting " data-user-id="2845398050"
data-screen-name="Deheyb" data-name="4k Scrub✌️" data-protected="false">
[9][1] = "
[9][2] = Deheyb
[9][3] = "
[9][4] = 4k Scrub✌️
[9][5] = "
[9][6] = false
[10][0] = <div class="user-actions btn-group not-following not-muting " data-user-id="721815663216566272"
data-screen-name="Ribbon2712" data-name="Даниил Демидов" data-protected="false">
[10][1] = "
[10][2] = Ribbon2712
[10][3] = "
[10][4] = Даниил Демидов
[10][5] = "
[10][6] = false
[11][0] = <div class="user-actions btn-group not-following not-muting " data-user-id="3248438456"
data-screen-name="zayarmgmg95" data-name="Zayar Mg" data-protected="false">
[11][1] = "
[11][2] = zayarmgmg95
[11][3] = "
[11][4] = Zayar Mg
[11][5] = "
[11][6] = false
[12][0] = <div class="user-actions btn-group not-following not-muting " data-user-id="726440286063198208"
data-screen-name="Ninderpy" data-name="Derpy" data-protected="false">
[12][1] = "
[12][2] = Ninderpy
[12][3] = "
[12][4] = Derpy
[12][5] = "
[12][6] = false
[13][0] = <div class="user-actions btn-group not-following not-muting " data-user-id="423763655"
data-screen-name="ImJoehuff" data-name="JoeyT" data-protected="false">
[13][1] = "
[13][2] = ImJoehuff
[13][3] = "
[13][4] = JoeyT
[13][5] = "
[13][6] = false
[14][0] = <div class="user-actions btn-group not-following not-muting " data-user-id="726441786839703556"
data-screen-name="zxmir_" data-name="Zxmir_" data-protected="false">
[14][1] = "
[14][2] = zxmir_
[14][3] = "
[14][4] = Zxmir_
[14][5] = "
[14][6] = false
[15][0] = <div class="user-actions btn-group not-following not-muting " data-user-id="726440845713367041"
data-screen-name="hienlequang" data-name="Hiền Lê Quang" data-protected="false">
[15][1] = "
[15][2] = hienlequang
[15][3] = "
[15][4] = Hiền Lê Quang
[15][5] = "
[15][6] = false
[16][0] = <div class="user-actions btn-group not-following not-muting " data-user-id="3032113115"
data-screen-name="Najer14" data-name="Jan" data-protected="false">
[16][1] = "
[16][2] = Najer14
[16][3] = "
[16][4] = Jan
[16][5] = "
[16][6] = false
[17][0] = <div class="user-actions btn-group not-following not-muting " data-user-id="4762819022"
data-screen-name="7forOne" data-name="Abiel" data-protected="false">
[17][1] = "
[17][2] = 7forOne
[17][3] = "
[17][4] = Abiel
[17][5] = "
[17][6] = false
[18][0] = <div class="user-actions btn-group not-following not-muting " data-user-id="717061680799330306"
data-screen-name="Th3uN1qu31" data-name="Th3_uN1Qu3" data-protected="false">
[18][1] = "
[18][2] = Th3uN1qu31
[18][3] = "
[18][4] = Th3_uN1Qu3
[18][5] = "
[18][6] = false
NODE EXPLANATION
----------------------------------------------------------------------
<div '<div'
----------------------------------------------------------------------
\b the boundary between a word char (\w) and
something that is not a word char
----------------------------------------------------------------------
(?= look ahead to see if there is:
----------------------------------------------------------------------
(?: group, but do not capture (0 or more
times (matching the least amount
possible)):
----------------------------------------------------------------------
[^>=] any character except: '>', '='
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
=' '=\''
----------------------------------------------------------------------
[^']* any character except: ''' (0 or more
times (matching the most amount
possible))
----------------------------------------------------------------------
' '\''
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
=" '="'
----------------------------------------------------------------------
[^"]* any character except: '"' (0 or more
times (matching the most amount
possible))
----------------------------------------------------------------------
" '"'
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
= '='
----------------------------------------------------------------------
[^'"] any character except: ''', '"'
----------------------------------------------------------------------
[^\s>]* any character except: whitespace (\n,
\r, \t, \f, and " "), '>' (0 or more
times (matching the most amount
possible))
----------------------------------------------------------------------
)*? end of grouping
----------------------------------------------------------------------
class= 'class='
----------------------------------------------------------------------
['"]? any character of: ''', '"' (optional
(matching the most amount possible))
----------------------------------------------------------------------
user-actions 'user-actions'
----------------------------------------------------------------------
) end of look-ahead
----------------------------------------------------------------------
(?= look ahead to see if there is:
----------------------------------------------------------------------
(?: group, but do not capture (0 or more
times (matching the least amount
possible)):
----------------------------------------------------------------------
[^>=] any character except: '>', '='
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
=' '=\''
----------------------------------------------------------------------
[^']* any character except: ''' (0 or more
times (matching the most amount
possible))
----------------------------------------------------------------------
' '\''
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
=" '="'
----------------------------------------------------------------------
[^"]* any character except: '"' (0 or more
times (matching the most amount
possible))
----------------------------------------------------------------------
" '"'
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
= '='
----------------------------------------------------------------------
[^'"] any character except: ''', '"'
----------------------------------------------------------------------
[^\s>]* any character except: whitespace (\n,
\r, \t, \f, and " "), '>' (0 or more
times (matching the most amount
possible))
----------------------------------------------------------------------
)*? end of grouping
----------------------------------------------------------------------
data-screen-name= 'data-screen-name='
----------------------------------------------------------------------
( group and capture to \1:
----------------------------------------------------------------------
['"]? any character of: ''', '"' (optional
(matching the most amount possible))
----------------------------------------------------------------------
) end of \1
----------------------------------------------------------------------
( group and capture to \2:
----------------------------------------------------------------------
.*? any character (0 or more times
(matching the least amount possible))
----------------------------------------------------------------------
) end of \2
----------------------------------------------------------------------
\1 what was matched by capture \1
----------------------------------------------------------------------
(?: group, but do not capture:
----------------------------------------------------------------------
\s whitespace (\n, \r, \t, \f, and " ")
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
> '>'
----------------------------------------------------------------------
) end of grouping
----------------------------------------------------------------------
) end of look-ahead
----------------------------------------------------------------------
(?= look ahead to see if there is:
----------------------------------------------------------------------
(?: group, but do not capture (0 or more
times (matching the least amount
possible)):
----------------------------------------------------------------------
[^>=] any character except: '>', '='
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
=' '=\''
----------------------------------------------------------------------
[^']* any character except: ''' (0 or more
times (matching the most amount
possible))
----------------------------------------------------------------------
' '\''
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
=" '="'
----------------------------------------------------------------------
[^"]* any character except: '"' (0 or more
times (matching the most amount
possible))
----------------------------------------------------------------------
" '"'
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
= '='
----------------------------------------------------------------------
[^'"] any character except: ''', '"'
----------------------------------------------------------------------
[^\s>]* any character except: whitespace (\n,
\r, \t, \f, and " "), '>' (0 or more
times (matching the most amount
possible))
----------------------------------------------------------------------
)*? end of grouping
----------------------------------------------------------------------
data-name= 'data-name='
----------------------------------------------------------------------
( group and capture to \3:
----------------------------------------------------------------------
['"]? any character of: ''', '"' (optional
(matching the most amount possible))
----------------------------------------------------------------------
) end of \3
----------------------------------------------------------------------
( group and capture to \4:
----------------------------------------------------------------------
.*? any character (0 or more times
(matching the least amount possible))
----------------------------------------------------------------------
) end of \4
----------------------------------------------------------------------
\3 what was matched by capture \3
----------------------------------------------------------------------
(?: group, but do not capture:
----------------------------------------------------------------------
\s whitespace (\n, \r, \t, \f, and " ")
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
> '>'
----------------------------------------------------------------------
) end of grouping
----------------------------------------------------------------------
) end of look-ahead
----------------------------------------------------------------------
(?= look ahead to see if there is:
----------------------------------------------------------------------
(?: group, but do not capture (0 or more
times (matching the least amount
possible)):
----------------------------------------------------------------------
[^>=] any character except: '>', '='
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
=' '=\''
----------------------------------------------------------------------
[^']* any character except: ''' (0 or more
times (matching the most amount
possible))
----------------------------------------------------------------------
' '\''
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
=" '="'
----------------------------------------------------------------------
[^"]* any character except: '"' (0 or more
times (matching the most amount
possible))
----------------------------------------------------------------------
" '"'
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
= '='
----------------------------------------------------------------------
[^'"] any character except: ''', '"'
----------------------------------------------------------------------
[^\s>]* any character except: whitespace (\n,
\r, \t, \f, and " "), '>' (0 or more
times (matching the most amount
possible))
----------------------------------------------------------------------
)*? end of grouping
----------------------------------------------------------------------
data-protected= 'data-protected='
----------------------------------------------------------------------
( group and capture to \5:
----------------------------------------------------------------------
['"]? any character of: ''', '"' (optional
(matching the most amount possible))
----------------------------------------------------------------------
) end of \5
----------------------------------------------------------------------
( group and capture to \6:
----------------------------------------------------------------------
.*? any character (0 or more times
(matching the least amount possible))
----------------------------------------------------------------------
) end of \6
----------------------------------------------------------------------
\5 what was matched by capture \5
----------------------------------------------------------------------
(?: group, but do not capture:
----------------------------------------------------------------------
\s whitespace (\n, \r, \t, \f, and " ")
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
> '>'
----------------------------------------------------------------------
) end of grouping
----------------------------------------------------------------------
) end of look-ahead
----------------------------------------------------------------------
(?: group, but do not capture (0 or more times
(matching the most amount possible)):
----------------------------------------------------------------------
[^>=] any character except: '>', '='
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
=' '=\''
----------------------------------------------------------------------
(?: group, but do not capture (0 or more
times (matching the most amount
possible)):
----------------------------------------------------------------------
[^'] any character except: '''
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
\\ '\'
----------------------------------------------------------------------
' '\''
----------------------------------------------------------------------
)* end of grouping
----------------------------------------------------------------------
' '\''
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
=" '="'
----------------------------------------------------------------------
(?: group, but do not capture (0 or more
times (matching the most amount
possible)):
----------------------------------------------------------------------
[^"] any character except: '"'
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
\\ '\'
----------------------------------------------------------------------
" '"'
----------------------------------------------------------------------
)* end of grouping
----------------------------------------------------------------------
" '"'
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
= '='
----------------------------------------------------------------------
[^'"] any character except: ''', '"'
----------------------------------------------------------------------
[^\s>]* any character except: whitespace (\n,
\r, \t, \f, and " "), '>' (0 or more
times (matching the most amount
possible))
----------------------------------------------------------------------
)* end of grouping
----------------------------------------------------------------------
\s* whitespace (\n, \r, \t, \f, and " ") (0 or
more times (matching the most amount
possible))
----------------------------------------------------------------------
> '>'