Python正则表达式从<a> tag from html content

时间:2018-06-01 10:33:49

标签: python html regex url

I am trying to fetch non-http(s) urls from anchor tag. I need to match the entire anchor tag if such an url is found.

Example :

This should match: <a href="example.com/index.html"> bla</a>

This shouldn't match: <a href="https://www.google.com/">bla2 </a>

I have been able to build this regex so far:

(\<a[\s\S]*?)(?<=href)(?:(=[\"\'])|(=))(?!(http[s]?)|(ww[w]?)|(#)|(\/\/))
(?P<url>[\S]*?)(?=([\"\'])|(\s))([\s\S]*?\>)

But this gives me a match even for the one with HTTP.

With this regex : (?<=href=[\"\'])(?!(http[s]?)|(ww[w]?))(?P<url>[\S]+)(?=[\"\']) I am able to get only the non-http url but i need the entire content of <a> tag getting matched, too.

Any suggestions would be great. Happy if this can be further improved. PS: I can not use beautifulsoup. So please suggest a better regex for my problem.

1 个答案:

答案 0 :(得分:0)

这可能有效:

(<a[^>]*href=[\"\'](?!http|ww)(?:\S+)[\"\'][^>]*>)

如果您需要<a href="example.com/index.html">之前的所有内容,则会匹配</a>,例如在右括号之前.*?</\s*a>

解释

  • (?!http|ww):负面预测,实际上https?实际上不需要(?!http),因为http已经匹配httpswwwww也是如此和(?:\S+)
  • [^>]*:网址。这可以改进,因为URL中不允许使用许多符号,但目前就足够了。
  • empt_list=[] empt_list_meaning=[] def game(): empt_dict = dict(zip(empt_list, empt_list_meaning)) a_options = input("Please select one of these options: ") if a_options == 1: a_newword = str(raw_input("What word you want to add? ")) empt_list.append(a_newword) a_newword_meaning=str(raw_input("add the meaning of the word: ")) empt_list_meaning.append(a_newword_meaning) elif a_options == 2: a_select_word = raw_input("select the words, you want") zero = 0 for word in empt_dict: if a_select_word in word: zero += 1 print zero, word, print empt_dict.keys().index(word) print ("would you like to continue or exit?\n1.contine\n2.exit") now = input(">>> ") if now == 1: game() else: print "bye" game() a可能包含其他内容。