转换纯文本推文以包含超链接

时间:2019-06-26 08:53:28

标签: regex perl

我已经从Twitter导出了数据,以便将自己的推文添加到我的个人博客中。过去十年中,我以纯文本形式发布了每条推文。一个例子是这样的:

When a new startup enters your industry and innovates around you, winning your customers and taking your revenues, if you fail to transform your own business in response, are you negligent? Do shareholders have a claim against you? https://myurl.com/blah #Governance #liability #corporatenegligence

我想处理每个推文,并将HTML锚标记添加到文本中找到的所有URL和标记。因此,基本上所有带有http / https的内容都可以变成一个链接,而带有哈希值的任何内容都可以变成一个链接。

我正在努力想出一个正则表达式来做到这一点。 URL的锚标记仅将URL本身用作href。标签上的href为https://twitter.com/hashtag/TAG,其中TAG是#后面且非字母数字字符之前的标签文本。

每条推文都存储在称为@tweets的标量数组中。因此,遍历它们很简单。

2 个答案:

答案 0 :(得分:3)

听起来有些基本的正则表达式很有用。

链接为“ http://”或“ https://”,后跟一系列非空格字符-https?://S+

#标签是#之后是一系列字母数字字符-#\w+的哈希。

因此,代码可能看起来像这样:

#!/usr/bin/perl

use strict;
use warnings;
use feature 'say';

$_ = join '', <DATA>;

# Before
say;

# Convert links
s|(https?://\S+)|<a href="$1">$1</a>|g;

# Convert hashtags
s|#(\w+)|<a href="https://twitter.com/hashtag/$1">#$1</a>|g;

# After
say;

__DATA__
When a new startup enters your industry and innovates around you, winning
your customers and taking your revenues, if you fail to transform your own
business in response, are you negligent? Do shareholders have a claim against
you? https://myurl.com/blah #Governance #liability #corporatenegligence

答案 1 :(得分:1)

尝试URL::Search。它将处理许多边缘情况,例如URL后跟标点符号或将其包围:

use strict;
use warnings;
use URL::Search '$URL_SEARCH_RE';

$text =~ s{($URL_SEARCH_RE)}{<a href="$1">$1</a>}g;

但是还有另一个问题。如果要将结果用作HTML,URL和周围的文本必须用HTML转义,但是您当然不想用HTML转义有效的HTML。为了解决这个问题,您可以将字符串拆分为URL和非URL部分,对两者进行转义并包装URL,然后将它们重新组合在一起。幸运的是,URL :: Search具有一个专为此设计的partition_urls函数。

use strict;
use warnings;
use utf8;
use URL::Search 'partition_urls';
use HTML::Entities;

my $text = do { local $/; <DATA> };

my $output = '';
foreach my $section (partition_urls $text) {
  my $escaped = encode_entities $section->[1];
  if ($section->[0] eq 'URL') {
    $output .= qq{<a href="$escaped">$escaped</a>};
  } else {
    $escaped =~ s{(?<!\S)#([a-zA-Z0-9]+)}{<a href="https://twitter.com/hashtag/$1">#$1</a>}g;
    $output .= $escaped;
  }
}

print $output;

__DATA__
When a new startup enters your industry and innovates around you, winning
your customers and taking your revenues, if you fail to transform your own
business in response, are you negligent? Do shareholders have a claim against
you? https://myurl.com/blah #Governance #liability #corporatenegligence

另一个复杂之处在于,在转义HTML之前,必须先将主题标签转义为URI,然后才能在URL中使用。 HTML转义,但是将主题标签中允许的字符限制为ASCII字母和数字{{1} }避免了这个问题。替代方法是,您必须再次将非URL部分拆分为标签和非标签文本,以进行单独处理。