关于:Find Hyperlinks in Text using Python (twitter related)
如何只提取网址,以便将其放入列表/数组?
让我澄清一下,我不想将URL解析成碎片。我想从字符串的文本中提取URL以将其放入数组中。谢谢!
答案 0 :(得分:45)
为了回应OP的编辑,我劫持了Find Hyperlinks in Text using Python (twitter related)并提出了这个问题:
import re
myString = "This is my tweet check it out http://example.com/blah"
print(re.search("(?P<url>https?://[^\s]+)", myString).group("url"))
答案 1 :(得分:25)
误解了问题:
>>> from urllib.parse import urlparse
>>> urlparse('http://www.ggogle.com/test?t')
ParseResult(scheme='http', netloc='www.ggogle.com', path='/test',
params='', query='t', fragment='')
>>> from urlparse import urlparse
>>> urlparse('http://www.cwi.nl:80/%7Eguido/Python.html')
ParseResult(scheme='http', netloc='www.cwi.nl:80', path='/%7Eguido/Python.html',
params='', query='', fragment='')
ETA :正则表达式确实是最佳选择:
>>> s = 'This is my tweet check it out http://tinyurl.com/blah and http://blabla.com'
>>> re.findall(r'(https?://\S+)', s)
['http://tinyurl.com/blah', 'http://blabla.com']
答案 2 :(得分:8)
这是一个带有巨大正则表达式的文件:
#!/usr/bin/python
# -*- coding: utf-8 -*-
"""
the web url matching regex used by markdown
http://daringfireball.net/2010/07/improved_regex_for_matching_urls
https://gist.github.com/gruber/8891611
"""
URL_REGEX = r"""(?i)\b((?:https?:(?:/{1,3}|[a-z0-9%])|[a-z0-9.\-]+[.](?:com|net|org|edu|gov|mil|aero|asia|biz|cat|coop|info|int|jobs|mobi|museum|name|post|pro|tel|travel|xxx|ac|ad|ae|af|ag|ai|al|am|an|ao|aq|ar|as|at|au|aw|ax|az|ba|bb|bd|be|bf|bg|bh|bi|bj|bm|bn|bo|br|bs|bt|bv|bw|by|bz|ca|cc|cd|cf|cg|ch|ci|ck|cl|cm|cn|co|cr|cs|cu|cv|cx|cy|cz|dd|de|dj|dk|dm|do|dz|ec|ee|eg|eh|er|es|et|eu|fi|fj|fk|fm|fo|fr|ga|gb|gd|ge|gf|gg|gh|gi|gl|gm|gn|gp|gq|gr|gs|gt|gu|gw|gy|hk|hm|hn|hr|ht|hu|id|ie|il|im|in|io|iq|ir|is|it|je|jm|jo|jp|ke|kg|kh|ki|km|kn|kp|kr|kw|ky|kz|la|lb|lc|li|lk|lr|ls|lt|lu|lv|ly|ma|mc|md|me|mg|mh|mk|ml|mm|mn|mo|mp|mq|mr|ms|mt|mu|mv|mw|mx|my|mz|na|nc|ne|nf|ng|ni|nl|no|np|nr|nu|nz|om|pa|pe|pf|pg|ph|pk|pl|pm|pn|pr|ps|pt|pw|py|qa|re|ro|rs|ru|rw|sa|sb|sc|sd|se|sg|sh|si|sj|Ja|sk|sl|sm|sn|so|sr|ss|st|su|sv|sx|sy|sz|tc|td|tf|tg|th|tj|tk|tl|tm|tn|to|tp|tr|tt|tv|tw|tz|ua|ug|uk|us|uy|uz|va|vc|ve|vg|vi|vn|vu|wf|ws|ye|yt|yu|za|zm|zw)/)(?:[^\s()<>{}\[\]]+|\([^\s()]*?\([^\s()]+\)[^\s()]*?\)|\([^\s]+?\))+(?:\([^\s()]*?\([^\s()]+\)[^\s()]*?\)|\([^\s]+?\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’])|(?:(?<!@)[a-z0-9]+(?:[.\-][a-z0-9]+)*[.](?:com|net|org|edu|gov|mil|aero|asia|biz|cat|coop|info|int|jobs|mobi|museum|name|post|pro|tel|travel|xxx|ac|ad|ae|af|ag|ai|al|am|an|ao|aq|ar|as|at|au|aw|ax|az|ba|bb|bd|be|bf|bg|bh|bi|bj|bm|bn|bo|br|bs|bt|bv|bw|by|bz|ca|cc|cd|cf|cg|ch|ci|ck|cl|cm|cn|co|cr|cs|cu|cv|cx|cy|cz|dd|de|dj|dk|dm|do|dz|ec|ee|eg|eh|er|es|et|eu|fi|fj|fk|fm|fo|fr|ga|gb|gd|ge|gf|gg|gh|gi|gl|gm|gn|gp|gq|gr|gs|gt|gu|gw|gy|hk|hm|hn|hr|ht|hu|id|ie|il|im|in|io|iq|ir|is|it|je|jm|jo|jp|ke|kg|kh|ki|km|kn|kp|kr|kw|ky|kz|la|lb|lc|li|lk|lr|ls|lt|lu|lv|ly|ma|mc|md|me|mg|mh|mk|ml|mm|mn|mo|mp|mq|mr|ms|mt|mu|mv|mw|mx|my|mz|na|nc|ne|nf|ng|ni|nl|no|np|nr|nu|nz|om|pa|pe|pf|pg|ph|pk|pl|pm|pn|pr|ps|pt|pw|py|qa|re|ro|rs|ru|rw|sa|sb|sc|sd|se|sg|sh|si|sj|Ja|sk|sl|sm|sn|so|sr|ss|st|su|sv|sx|sy|sz|tc|td|tf|tg|th|tj|tk|tl|tm|tn|to|tp|tr|tt|tv|tw|tz|ua|ug|uk|us|uy|uz|va|vc|ve|vg|vi|vn|vu|wf|ws|ye|yt|yu|za|zm|zw)\b/?(?!@)))"""
我将该文件称为urlmarker.py
,当我需要它时,我只需导入它,例如
import urlmarker
import re
re.findall(urlmarker.URL_REGEX,'some text news.yahoo.com more text')
比照http://daringfireball.net/2010/07/improved_regex_for_matching_urls和What's the cleanest way to extract URLs from a string using Python?
答案 3 :(得分:3)
关于这个:
import re
myString = "This is my tweet check it out http:// tinyurl.com/blah"
print re.search("(?P<url>https?://[^\s]+)", myString).group("url")
如果字符串中有多个网址,它将无法正常工作。 如果字符串如下:
myString = "This is my tweet check it out http:// tinyurl.com/blah and http:// blabla.com"
您可以这样做:
myString_list = [item for item in myString.split(" ")]
for item in myString_list:
try:
print re.search("(?P<url>https?://[^\s]+)", item).group("url")
except:
pass
答案 4 :(得分:3)
不要忘记检查搜索是否返回None
的值 - 我发现上面的帖子很有帮助,但却浪费了处理None
结果的时间。
请参阅Python Regex "object has no attribute"。
即
import re
myString = "This is my tweet check it out http://tinyurl.com/blah"
match = re.search("(?P<url>https?://[^\s]+)", myString)
if match is not None:
print match.group("url")
答案 5 :(得分:3)
如果您想从任何文本中提取网址,可以使用我的urlextract。它根据文本中的TLD查找URL。它从TLD位置扩展到两侧并获得整个URL。它易于使用。检查一下:https://github.com/lipoja/URLExtract
from urlextract import URLExtract
extractor = URLExtract()
urls = extractor.find_urls("Text with URLs: stackoverflow.com.")
答案 6 :(得分:2)
您可以使用以下怪异的正则表达式:
<android.support.constraint.ConstraintLayout
android:layout_width="match_parent"
android:layout_height="match_parent"
tools:context=".MainActivity">
<android.support.constraint.ConstraintLayout
android:layout_width="200dp"
android:layout_height="200dp"
android:rotation="45"
app:layout_constraintBottom_toBottomOf="parent"
app:layout_constraintEnd_toEndOf="parent"
app:layout_constraintStart_toStartOf="parent"
app:layout_constraintTop_toTopOf="parent">
<ImageView
android:layout_width="100dp"
android:layout_height="100dp"
android:background="@android:color/holo_green_light"
app:layout_constraintStart_toStartOf="parent"
app:layout_constraintTop_toTopOf="parent" />
<ImageView
android:layout_width="100dp"
android:layout_height="100dp"
android:background="@android:color/holo_blue_light"
app:layout_constraintBottom_toBottomOf="parent"
app:layout_constraintStart_toStartOf="parent" />
<ImageView
android:layout_width="100dp"
android:layout_height="100dp"
android:background="@android:color/holo_red_light"
app:layout_constraintEnd_toEndOf="parent"
app:layout_constraintTop_toTopOf="parent" />
<ImageView
android:layout_width="100dp"
android:layout_height="100dp"
android:background="@android:color/holo_orange_light"
app:layout_constraintBottom_toBottomOf="parent"
app:layout_constraintEnd_toEndOf="parent" />
</android.support.constraint.ConstraintLayout>
</android.support.constraint.ConstraintLayout>
此正则表达式将接受以下格式的网址:
<强> INPUT:强>
\b((?:https?://)?(?:(?:www\.)?(?:[\da-z\.-]+)\.(?:[a-z]{2,6})|(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)|(?:(?:[0-9a-fA-F]{1,4}:){7,7}[0-9a-fA-F]{1,4}|(?:[0-9a-fA-F]{1,4}:){1,7}:|(?:[0-9a-fA-F]{1,4}:){1,6}:[0-9a-fA-F]{1,4}|(?:[0-9a-fA-F]{1,4}:){1,5}(?::[0-9a-fA-F]{1,4}){1,2}|(?:[0-9a-fA-F]{1,4}:){1,4}(?::[0-9a-fA-F]{1,4}){1,3}|(?:[0-9a-fA-F]{1,4}:){1,3}(?::[0-9a-fA-F]{1,4}){1,4}|(?:[0-9a-fA-F]{1,4}:){1,2}(?::[0-9a-fA-F]{1,4}){1,5}|[0-9a-fA-F]{1,4}:(?:(?::[0-9a-fA-F]{1,4}){1,6})|:(?:(?::[0-9a-fA-F]{1,4}){1,7}|:)|fe80:(?::[0-9a-fA-F]{0,4}){0,4}%[0-9a-zA-Z]{1,}|::(?:ffff(?::0{1,4}){0,1}:){0,1}(?:(?:25[0-5]|(?:2[0-4]|1{0,1}[0-9]){0,1}[0-9])\.){3,3}(?:25[0-5]|(?:2[0-4]|1{0,1}[0-9]){0,1}[0-9])|(?:[0-9a-fA-F]{1,4}:){1,4}:(?:(?:25[0-5]|(?:2[0-4]|1{0,1}[0-9]){0,1}[0-9])\.){3,3}(?:25[0-5]|(?:2[0-4]|1{0,1}[0-9]){0,1}[0-9])))(?::[0-9]{1,4}|[1-5][0-9]{4}|6[0-4][0-9]{3}|65[0-4][0-9]{2}|655[0-2][0-9]|6553[0-5])?(?:/[\w\.-]*)*/?)\b
<强>输出:强>
add1 http://mit.edu.com abc
add2 https://facebook.jp.com.2. abc
add3 www.google.be. uvw
add4 https://www.google.be. 123
add5 www.website.gov.us test2
Hey bob on www.test.com.
another test with ipv4 http://192.168.1.1/test.jpg. toto2
website with different port number www.test.com:8080/test.jpg not port 80
www.website.gov.us/login.html
test with ipv4 192.168.1.1/test.jpg.
search at google.co.jp/maps.
test with ipv6 2001:0db8:0000:85a3:0000:0000:ac1f:8001/test.jpg.
<强>说明:强>
http://mit.edu.com
https://facebook.jp.com
www.google.be
https://www.google.be
www.website.gov.us
www.test.com
http://192.168.1.1/test.jpg
www.test.com:8080/test.jpg
www.website.gov.us/login.html
192.168.1.1/test.jpg
google.co.jp/maps
2001:0db8:0000:85a3:0000:0000:ac1f:8001/test.jpg
用于字边界以界定网址和文本的其余部分\b
匹配http://或https //(如果存在)(?:https?://)?
匹配标准网址(可能以(?:(?:www\.)?(?:[\da-z\.-]+)\.(?:[a-z]{2,6})
开头(让我们称之为www.
)STANDARD_URL
以匹配标准Ipv4(让我们称之为(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)
)IPv4
(我们称之为(?:(?:[0-9a-fA-F]{1,4}:){7,7}[0-9a-fA-F]{1,4}|(?:[0-9a-fA-F]{1,4}:){1,7}:|(?:[0-9a-fA-F]{1,4}:){1,6}:[0-9a-fA-F]{1,4}|(?:[0-9a-fA-F]{1,4}:){1,5}(?::[0-9a-fA-F]{1,4}){1,2}|(?:[0-9a-fA-F]{1,4}:){1,4}(?::[0-9a-fA-F]{1,4}){1,3}|(?:[0-9a-fA-F]{1,4}:){1,3}(?::[0-9a-fA-F]{1,4}){1,4}|(?:[0-9a-fA-F]{1,4}:){1,2}(?::[0-9a-fA-F]{1,4}){1,5}|[0-9a-fA-F]{1,4}:(?:(?::[0-9a-fA-F]{1,4}){1,6})|:(?:(?::[0-9a-fA-F]{1,4}){1,7}|:)|fe80:(?::[0-9a-fA-F]{0,4}){0,4}%[0-9a-zA-Z]{1,}|::(?:ffff(?::0{1,4}){0,1}:){0,1}(?:(?:25[0-5]|(?:2[0-4]|1{0,1}[0-9]){0,1}[0-9])\.){3,3}(?:25[0-5]|(?:2[0-4]|1{0,1}[0-9]){0,1}[0-9])|(?:[0-9a-fA-F]{1,4}:){1,4}:(?:(?:25[0-5]|(?:2[0-4]|1{0,1}[0-9]){0,1}[0-9])\.){3,3}(?:25[0-5]|(?:2[0-4]|1{0,1}[0-9]){0,1}[0-9]))
)IPv6
)如果存在:PORT
(?::[0-9]{1,4}|[1-5][0-9]{4}|6[0-4][0-9]{3}|65[0-4][0-9]{2}|655[0-2][0-9]|6553[0-5])
目标对象部分(html文件,jpg,...)(让我们称之为(?:/[\w\.-]*)*/?)
)这提供了以下正则表达式:
RESSOURCE_PATH
<强>来源:强>
IPv6 :Regular expression that matches valid IPv6 addresses
PORT :https://stackoverflow.com/a/12968117/8794221
其他来源: https://code.tutsplus.com/tutorials/8-regular-expressions-you-should-know--net-6149
\b((?:https?://)?(?:STANDARD_URL|IPv4|IPv6)(?:PORT)?(?:RESSOURCE_PATH)\b
<强>输出:强>
$ more url.py
import re
inputString = """add1 http://mit.edu.com abc
add2 https://facebook.jp.com.2. abc
add3 www.google.be. uvw
add4 https://www.google.be. 123
add5 www.website.gov.us test2
Hey bob on www.test.com.
another test with ipv4 http://192.168.1.1/test.jpg. toto2
website with different port number www.test.com:8080/test.jpg not port 80
www.website.gov.us/login.html
test with ipv4 (192.168.1.1/test.jpg).
search at google.co.jp/maps.
test with ipv6 2001:0db8:0000:85a3:0000:0000:ac1f:8001/test.jpg."""
regex=ur"\b((?:https?://)?(?:(?:www\.)?(?:[\da-z\.-]+)\.(?:[a-z]{2,6})|(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)|(?:(?:[0-9a-fA-F]{1,4}:){7,7}[0-9a-fA-F]{1,4}|(?:[0-9a-fA-F]{1,4}:){1,7}:|(?:[0-9a-fA-F]{1,4}:){1,6}:[0-9a-fA-F]{1,4}|(?:[0-9a-fA-F]{1,4}:){1,5}(?::[0-9a-fA-F]{1,4}){1,2}|(?:[0-9a-fA-F]{1,4}:){1,4}(?::[0-9a-fA-F]{1,4}){1,3}|(?:[0-9a-fA-F]{1,4}:){1,3}(?::[0-9a-fA-F]{1,4}){1,4}|(?:[0-9a-fA-F]{1,4}:){1,2}(?::[0-9a-fA-F]{1,4}){1,5}|[0-9a-fA-F]{1,4}:(?:(?::[0-9a-fA-F]{1,4}){1,6})|:(?:(?::[0-9a-fA-F]{1,4}){1,7}|:)|fe80:(?::[0-9a-fA-F]{0,4}){0,4}%[0-9a-zA-Z]{1,}|::(?:ffff(?::0{1,4}){0,1}:){0,1}(?:(?:25[0-5]|(?:2[0-4]|1{0,1}[0-9]){0,1}[0-9])\.){3,3}(?:25[0-5]|(?:2[0-4]|1{0,1}[0-9]){0,1}[0-9])|(?:[0-9a-fA-F]{1,4}:){1,4}:(?:(?:25[0-5]|(?:2[0-4]|1{0,1}[0-9]){0,1}[0-9])\.){3,3}(?:25[0-5]|(?:2[0-4]|1{0,1}[0-9]){0,1}[0-9])))(?::[0-9]{1,4}|[1-5][0-9]{4}|6[0-4][0-9]{3}|65[0-4][0-9]{2}|655[0-2][0-9]|6553[0-5])?(?:/[\w\.-]*)*/?)\b"
matches = re.findall(regex, inputString)
print(matches)
答案 7 :(得分:2)
只需按照以下代码即可享受....!!!!
<figure>
<img src="img/linus_torvald.jpg" alt="Linus Torvald smiling face" />
<figcaption>Linus Torvald smiling face</figcaption>
</figure>
答案 8 :(得分:1)
[注意:假设你在Twitter数据上使用这个(如有问题所示),最简单的方法是使用他们的API,它将从推文中提取的网址作为字段返回]
答案 9 :(得分:0)
如果从HTML源中提取:
from urlextract import URLExtract
from requests import get
url = "sample.com/samplepage/"
req = requests.get(url)
text = req.text
# or if you already have the html source:
# text2 = "This is html for ex <a href='http://google.com/'>Google</a> <a href='http://yahoo.com/'>Yahoo</a>"
text = text.replace(' ', '').replace('=','')
extractor = URLExtract()
print(extractor.find_urls(text))
输出(text2):
['http://google.com/', 'http://yahoo.com/']