Question

The image contains the example data set i want to design a regex which will give me only Id and title from the data set

例如

15011721827：52352403：印度联赛之战 52352403是文章的ID。印度联盟战争是本文的标题。

我想从给定的文本文件中提取ID和标题对

Answer 1

([0-9]+)[:]([0-9]+)[:](.*)\n

捕获组（[0-9] +） +量词-匹配一次和无限次，并尽可能多地匹配， 0-9介于0到9之间的单个字符

[：]与字符'：'

匹配

•第三捕获小组（。*）匹配任何字符（行终止符除外）

•\ n匹配换行符

import re

text = open('example.txt').read()
pattern = r'([0-9]+)[:]([0-9]+)[:](.*)\n'
regex = re.compile(pattern)
for match in regex.finditer(text):
      result = ("{},{}".format(match.group(2),match.group(3)))

Answer 2

使用Javascript，您可以简单地使用split()来做到这一点，当匹配双点时，将字符串分开：

var text = "1234567890:12312312:Lorem ipsum dolor sit amet";
var splitted = text.split(":");

console.log("id : " + splitted[1]);
console.log("Title : " + splitted[2]);

使用纯正则表达式，您可以使用以下命令：([0-9]{10,})[:]([0-9]{8})[:]([a-zA-Z ]+)

Group 1 : 1234567890
Group 2 (ID) : 12312312 
Group 3 (Title) : Lorem ipsum dolor sit amet

第一组将检测0到9之间的10个数字。第二组将检测0到9之间的8个数字。第三组将检测到A到Z和空格。

工作示例：https://regex101.com/r/3TudrD/1

Answer 3

因为在数据集中，标题中可以包含:，所以最好像下面这样使用RegEx

15011721827:52352403:War of the League of the Indies
9428491646:27687104:Deepwater Pathfinder
3524782652:4285058:Wikipedia:Articles for deletion/Joseph Prymak
2302538806:1870985:Cardinal Infante Ferdinand`

第三行上有一个:，它将Wikipedia与标题的其余部分分开，如果您使用split函数，则将有一个数组，其中包含4个而不是3个部分。为避免此类问题，我选择使用正则表达式

var pattern = /^(\d+):(\d+):(.+)$/
var data = "15011721827:52352403:War of the League of the Indies"
var matches = data.match(pattern)
console.log(matches)

// matches[0] = "15011721827:52352403:War of the League of the Indies"
// matches[1] = "15011721827"
// matches[2] = "52352403"
// matches[3] = "War of the League of the Indies"

为特定数据集设计正则表达式

3 个答案: