Question

我有很长的字符串，例如

"123 - Footwear, 5678 - Apparel, Accessories & Luxury Goods, 9876 - Leisure Products"

和

"321 - Apparel & Accessories, 4321 - Apparel & Accessories > Handbags, Wallets & Cases, 187 - Apparel & Accessories > Shoes"

我想根据模式“一个数字，一个空格，一个破折号，一个空格，一些字符串直到下一个数字，一个空格，破折号，一个空格或字符串的结尾”对它们进行拆分。请注意，该字符串可能包含逗号，“＆”号，“>”和其他特殊字符，因此对它们进行拆分将无效。我认为Python中有一种方法可以根据正则表达式进行拆分，但是我很难形成这种形式。

我对正则表达式非常了解。我可以为数字以及字母数字字符串组成一个正则表达式，但是我不知道如何指定“在下一个数字开始之前先完成所有操作”。

更新：预期输出：

第一种情况：

["123 - Footwear", "5678 - Apparel, Accessories & Luxury Goods", "9876 - Leisure Products"]

第二种情况：

["321 - Apparel & Accessories", "4321 - Apparel & Accessories > Handbags, Wallets & Cases", "187 - Apparel & Accessories > Shoes"]

Answer 1

这里是模式，首先有一个数字，所以我们使用[0-9]+，后跟字符串和特殊字符，例如& - >，因此我们可以使用{{ 1}}：

[a-zA-Z \-&>]+

您在OP中提到的另一个字符串

>>> str_ = "123 - Footwear, 5678 - Apparel, Accessories & Luxury Goods, 9876 - Leisure Products"
>>> re.findall(r'(?is)([0-9]+[a-zA-Z \-&>,]+)', str_)
['123 - Footwear, ',
 '5678 - Apparel, Accessories & Luxury Goods, ',
 '9876 - Leisure Products']

Answer 2

如果数字仅出现在每个字符串段的开头，则可以执行以下操作：

df.repartition(col("A"), col("B")).sortWithinPartitions(desc("C")) ...

这将输出：

import re
for s in "123 - Footwear, 5678 - Apparel, Accessories & Luxury Goods, 9876 - Leisure Products", "321 - Apparel & Accessories, 4321 - Apparel & Accessories > Handbags, Wallets & Cases, 187 - Apparel & Accessories > Shoes":
    print(re.findall(r'\d+\D+(?=,\s*\d|$)', s))

此正则表达式模式首先使用['123 - Footwear', '5678 - Apparel, Accessories & Luxury Goods', '9876 - Leisure Products'] ['321 - Apparel & Accessories', '4321 - Apparel & Accessories > Handbags, Wallets & Cases', '187 - Apparel & Accessories > Shoes']来匹配数字，然后使用\d+来匹配非数字，并使用超前模式\D+来确保非数字停在要点后面紧跟一个逗号，一些空格和另一个数字，或字符串的末尾，这样所得到的匹配将不包含尾随逗号和空格。

Answer 3

您可以匹配以一个或多个数字开头，后跟1+个空格，-，1+个空格并以相同的模式或字符串结尾的子字符串。

re.findall(r"\d+\s+-\s+.*?(?=\s*(?:,\s*)?\d+\s+-\s|\Z)", s, re.S)

请参见regex demo

注意：如果前导数字长度大于一，例如至少为两位数，则可以将\d+替换为\d{2,}，等等。根据需要进行调整。

正则表达式演示

\d+-1个以上数字
\s+-\s+-用1+空格括起来的-
.*?-尽可能少的0个字符，直至字符串中紧跟着...的位置。
(?=\s*(?:,\s*)?\d+\s+-\s|\Z)-（积极向前看）：
- \s*(?:,\s*)?\d+\s+-\s-0+个空格，一个可选的逗号子字符串和其后的0+个空格，1 +个数字，1 +个空格，-和一个空格
- |-或
- \Z-字符串结尾

Python demo：

import re

rx = r"\d+\s+-\s+.*?(?=\s*(?:,\s*)?\d+\s+-\s|\Z)"
texts = ["123 - Footwear, 5678 - Apparel, Accessories & Luxury Goods, 9876 - Leisure Products", "321 - Apparel & Accessories, 4321 - Apparel & Accessories > Handbags, Wallets & Cases, 187 - Apparel & Accessories > Shoes"]
for s in texts:
    print("--- {} ----".format(s))
    print(re.findall(rx, s, re.S))

输出：

--- 123 - Footwear, 5678 - Apparel, Accessories & Luxury Goods, 9876 - Leisure Products ---
['123 - Footwear', '5678 - Apparel, Accessories & Luxury Goods', '9876 - Leisure Products']
--- 321 - Apparel & Accessories, 4321 - Apparel & Accessories > Handbags, Wallets & Cases, 187 - Apparel & Accessories > Shoes ---
['321 - Apparel & Accessories', '4321 - Apparel & Accessories > Handbags, Wallets & Cases', '187 - Apparel & Accessories > Shoes']

Answer 4

难道就像遇到数字时拆分一样简单吗？

s = "123 - Footwear, 5678 - Apparel, Accessories & Luxury Goods, 9876 - Leisure Products"
re.findall(r'\d+\D+', s) 

['123 - Footwear, ',
 '5678 - Apparel, Accessories & Luxury Goods, ',
 '9876 - Leisure Products']

根据Python中的模式分割字符串

4 个答案: