我正在尝试解决使用PHP完成的问题,不确定如何在Python中完成。
在以下三行中,我们希望根据以下两种模式进行匹配:
仅vine.co和twitter.com URL(其他域应忽略)
仅逗号,之前的URL(每行中的最后一个URL应被忽略)
Row 1: https://vine.co/v/5W2Dg3XPX7a,https://vine.co/v/5W2Dg3XPX7a
Row 2: https://twitter.com/dog_rates/status/836677758902222849/photo/1,https://twitter.com/dog_rates/status/836677758902222849/photo/1
Row 3: https://www.gofundme.com/lolas-life-saving-surgery-funds,https://twitter.com/dog_rates/status/835264098648616962/photo/1,https://twitter.com/dog_rates/status/835264098648616962/photo/1
输出将是Python中的数组(此输出基于PHP):
array(3) {
[0]=>
string(30) "https://vine.co/v/5W2Dg3XPX7a
"
[1]=>
string(64) "https://twitter.com/dog_rates/status/836677758902222849/photo/1
"
[2]=>
string(63) "https://twitter.com/dog_rates/status/835264098648616962/photo/1"
}
$input = 'Row 1: https://vine.co/v/5W2Dg3XPX7a,https://vine.co/v/5W2Dg3XPX7a
Row 2: https://twitter.com/dog_rates/status/836677758902222849/photo/1,https://twitter.com/dog_rates/status/836677758902222849/photo/1
Row 3: https://www.gofundme.com/lolas-life-saving-surgery-funds,https://twitter.com/dog_rates/status/835264098648616962/photo/1,https://twitter.com/dog_rates/status/835264098648616962/photo/1';
$array = preg_split('/Row\s\d:\s/s', $input);
$output = array();
foreach ($array as $key => $value) {
if (strlen($value) > 1) {
$URL_arrays = explode(',', $value);
foreach ($URL_arrays as $key => $value) {
if ($key = sizeof($URL_arrays) - 1) {
unset($URL_arrays[sizeof($URL_arrays) - 1]);
} else {
$match = preg_match('/twitter\.com|vine\.co/s', $value);
if ($match) {
array_push($output, $value);
}
}
}
}
}
var_dump($output);
该问题基于this RegEx problem,您可以回答其中任一问题。
答案 0 :(得分:2)
您可以使用此正则表达式捕获网址为vine.com
或twitter.com
的所有URL,这些URL的后面紧跟一个逗号,
https:\/\/(?:www\.)?(?:vine\.co|twitter\.com)[^,\s]*(?=,)
如您所愿,关键是要积极向前看(?=,)
,这可以确保您的URL后面紧跟一个逗号。
使用re.findall
import re
s = '''Row 1: https://vine.co/v/5W2Dg3XPX7a,https://vine.co/v/5W2Dg3XPX7a
Row 2: https://twitter.com/dog_rates/status/836677758902222849/photo/1,https://twitter.com/dog_rates/status/836677758902222849/photo/1
Row 3: https://www.gofundme.com/lolas-life-saving-surgery-funds,https://twitter.com/dog_rates/status/835264098648616962/photo/1,https://twitter.com/dog_rates/status/835264098648616962/photo/1'''
print(re.findall(r'https:\/\/(?:www\.)?(?:vine\.co|twitter\.com)[^,\s]*(?=,)', s))
输出
['https://vine.co/v/5W2Dg3XPX7a', 'https://twitter.com/dog_rates/status/836677758902222849/photo/1', 'https://twitter.com/dog_rates/status/835264098648616962/photo/1']
答案 1 :(得分:1)
因为您不需要保留重复项,所以我建议使用集合而不是数组(但是顺序更改):
{url for x in s.split('\n') for url in x.split(': ')[1].split(',') if 'vine.co' in url or 'twitter.co' in url}
代码:
s = '''Row 1: https://vine.co/v/5W2Dg3XPX7a,https://vine.co/v/5W2Dg3XPX7a
Row 2: https://twitter.com/dog_rates/status/836677758902222849/photo/1,https://twitter.com/dog_rates/status/836677758902222849/photo/1
Row 3: https://www.gofundme.com/lolas-life-saving-surgery-funds,https://twitter.com/dog_rates/status/835264098648616962/photo/1,https://twitter.com/dog_rates/status/835264098648616962/photo/1'''
print({url for x in s.split('\n') for url in x.split(': ')[1].split(',') if 'vine.co' in url or 'twitter.co' in url})
# {'https://twitter.com/dog_rates/status/835264098648616962/photo/1',
# 'https://twitter.com/dog_rates/status/836677758902222849/photo/1',
# 'https://vine.co/v/5W2Dg3XPX7a'}