Question

我正在准备使用Python 3在美国专利网站中准备包含文档标题的字符串，以用作搜索词。

1）保留较长的短语是有益的，但是

2）搜索包含很多长度不超过3个字符的单词时效果不佳，因此我需要消除它们。

我已经尝试过将正则表达式“ \ b \ w [1：3} \ b *”分割为一到三个字母词（带或不带尾随空格），但没有成功。但是，我不是正则表达式专家。

for pubtitle in df_tpdownloads['PublicationTitleSplit']:
    pubtitle = pubtitle.lower() # make lower case
    pubtitle = re.split("[?:.,;\"\'\-()]+", pubtitle) # tokenize and remove punctuation
    #print(pubtitle)

    for subArray in pubtitle:
        print(subArray)
        subArray = subArray.strip()
        subArray = re.split("(\b\w{1:3}\b) *", subArray) # split on words that are < 4 letters
        print(subArray)

以上代码逐步遍历了pandas系列并清除了标点符号，但未能按字长拆分。

我希望看到类似以下示例的内容。

示例：

所以

" and training requirements for selected salt applications"```

成为

['training requirements', 'selected salt applications']。

然后

"december 31"

成为

['december']。

然后

"experimental system for salt in an emergence research and applications in process heat"

成为

['experimental system', 'salt', 'emergence research', 'applications', 'process heat']。

但是拆分不能捕捉到小的单词，而且我无法确定问题是正则表达式，re.split命令还是这两者。

我可能可以采用蛮力方法，但是想要一个优雅的解决方案。任何帮助将不胜感激。

Answer 1

您可以使用

list(filter(None, re.split(r'\s*\b\w{1,3}\b\s*|[^\w\s]+', pubtitle.strip().lower())))

以获得所需的结果。参见regex demo。

r'\s*\b\w{1,3}\b\s*|[^\w\s]+'正则表达式将不带前导和尾随空格（由于.lower()）的小写（带有.strip()）字符串分割成没有标点符号（[^\w\s]+这样做）），并且没有1-3个单词的字符（\s*\b\w{1,3}\b\s*这样）。

模式详细信息

\s*-超过0个空格
\b-单词边界
\w{1,3}-1个，2个或3个字符的字符（如果不想匹配_，请使用[^\W_]+）
\b-单词边界
\s*-0+空格
|-或
[^\w\s]+-除单词和空格字符外的1个或更多字符。

请参见Python demo：

import re

df_tpdownloads = [" and training requirements for selected salt applications",
                  "december 31",
                  "experimental system for salt in an emergence research and applications in process heat"]

#for pubtitle in df_tpdownloads['PublicationTitleSplit']:
for pubtitle in df_tpdownloads:
    result = list(filter(None, re.split(r'\s*\b\w{1,3}\b\s*|[^\w\s]+', pubtitle.strip().lower())))
    print(result)

输出：

['training requirements', 'selected salt applications']
['december']
['experimental system', 'salt', 'emergence research', 'applications', 'process heat']

如何使用字长作为标记来拆分字符串

1 个答案: