Question

我想从字符串中删除网址，并将其替换为原始内容的标题。

例如：

mystring = "Ah I like this site: http://www.stackoverflow.com. Also I must say I like http://www.digg.com"

sanitize(mystring) # it becomes "Ah I like this site: Stack Overflow. Also I must say I like Digg - The Latest News Headlines, Videos and Images"

为了用标题替换url，我写了这个snipplet：

#get_title: string -> string
def get_title(url):
    """Returns the title of the input URL"""

    output = BeautifulSoup.BeautifulSoup(urllib.urlopen(url))
    return output.title.string

我不知何故需要将此函数应用于捕获网址的字符串，并通过get_title转换为标题。

Answer 1

以下是有关验证Python中网址的信息的问题：How do you validate a URL with a regular expression in Python?

urlparse模块可能是你最好的选择。您仍需在应用程序的上下文中确定构成有效URL的内容。

要检查字符串的字符串，您需要迭代字符串中的每个单词检查它，然后用标题替换有效的URL。

示例代码（您需要编写valid_url）：

def sanitize(mystring):
  for word in mystring.split(" "):
    if valid_url(word):
      mystring = mystring.replace(word, get_title(word))
  return mystring

Answer 2

你可以使用正则表达式和替换来解决这个问题（re.sub接受一个函数，它将为每个出现时传递Match对象并返回字符串以替换它）：

url = re.compile("http:\/\/(.*?)/")
text = url.sub(get_title, text)

困难的是创建一个匹配URL的正则表达式，而不是更多，而不是更少。

Python：用字符串中的标题名称替换url

2 个答案: