I am trying to fetch non-http(s) urls from anchor tag. I need to match the entire anchor tag if such an url is found.
This should match: <a href="example.com/index.html"> bla</a>
This shouldn't match: <a href="https://www.google.com/">bla2 </a>
I have been able to build this regex so far:
(\<a[\s\S]*?)(?<=href)(?:(=[\"\'])|(=))(?!(http[s]?)|(ww[w]?)|(#)|(\/\/))
(?P<url>[\S]*?)(?=([\"\'])|(\s))([\s\S]*?\>)
But this gives me a match even for the one with HTTP.
With this regex : (?<=href=[\"\'])(?!(http[s]?)|(ww[w]?))(?P<url>[\S]+)(?=[\"\'])
I am able to get only the non-http url but i need the entire content of <a>
tag getting matched, too.
Any suggestions would be great. Happy if this can be further improved. PS: I can not use beautifulsoup. So please suggest a better regex for my problem.
答案 0 :(得分:0)
这可能有效:
(<a[^>]*href=[\"\'](?!http|ww)(?:\S+)[\"\'][^>]*>)
如果您需要<a href="example.com/index.html">
之前的所有内容,则会匹配</a>
,例如在右括号之前.*?</\s*a>
。
(?!http|ww)
:负面预测,实际上https?
实际上不需要(?!http)
,因为http
已经匹配https
和ww
(www
也是如此和(?:\S+)
)[^>]*
:网址。这可以改进,因为URL中不允许使用许多符号,但目前就足够了。empt_list=[]
empt_list_meaning=[]
def game():
empt_dict = dict(zip(empt_list, empt_list_meaning))
a_options = input("Please select one of these options: ")
if a_options == 1:
a_newword = str(raw_input("What word you want to add? "))
empt_list.append(a_newword)
a_newword_meaning=str(raw_input("add the meaning of the word: "))
empt_list_meaning.append(a_newword_meaning)
elif a_options == 2:
a_select_word = raw_input("select the words, you want")
zero = 0
for word in empt_dict:
if a_select_word in word:
zero += 1
print zero, word,
print empt_dict.keys().index(word)
print ("would you like to continue or exit?\n1.contine\n2.exit")
now = input(">>> ")
if now == 1:
game()
else:
print "bye"
game()
a可能包含其他内容。