假设有以下文字
'\nModels: Introduction to models | Field types | Indexes | Meta options | Model class\nQuerySets: Making queries | QuerySet method reference | Lookup expressions\nModel instances: Instance methods | Accessing related objects\nMigrations: Introduction to Migrations | Operations reference | SchemaEditor | Writing migrations\nAdvanced: Managers | Raw SQL | Transactions | Aggregation | Search | Custom fields | Multiple databases | Custom lookups | Query Expressions | Conditional Expressions | Database Functions\nOther: Supported databases | Legacy databases | Providing initial data | Optimize database access | PostgreSQL specific features\n'
我想要实现的结果是
['Models: Introduction to models | Field types | Indexes | Meta options | Model class',
'QuerySets: Making queries | QuerySet method reference | Lookup expressions',
'Model instances: Instance methods | Accessing related objects',
'Migrations: Introduction to Migrations | Operations reference | SchemaEditor | Writing migrations',
'Advanced: Managers | Raw SQL | Transactions | Aggregation | Search | Custom fields | Multiple databases | Custom lookups | Query Expressions | Conditional Expressions | Database Functions',
'Other: Supported databases | Legacy databases | Providing initial data | Optimize database access | PostgreSQL specific features',]
我的第一次尝试是:
In [61]: re.split('\n', content)
Out[61]:
['',
'Models: Introduction to models | Field types | Indexes | Meta options | Model class',
'QuerySets: Making queries | QuerySet method reference | Lookup expressions',
'Model instances: Instance methods | Accessing related objects',
'Migrations: Introduction to Migrations | Operations reference | SchemaEditor | Writing migrations',
'Advanced: Managers | Raw SQL | Transactions | Aggregation | Search | Custom fields | Multiple databases | Custom lookups | Query Expressions | Conditional Expressions | Database Functions',
'Other: Supported databases | Legacy databases | Providing initial data | Optimize database access | PostgreSQL specific features',
'']
然而,当我尝试
时In [60]: re.split('\n.+',content)
Out[60]: ['', '', '', '', '', '', '\n']
输出超出我的预期。我无法理解。
在6.2. re—Regular expression operations
中的示例中re.split(r'\W+', 'Words, words, words.')
outputs ['Words', 'words', 'words', '']
not [',', ',', ',', ' ']
#why
re.split('\n.+',content)
outputs ['', '', '', '', '', '', '\n']
答案 0 :(得分:0)
这很明显:\n.+
匹配任何以换行符开头,后跟任何类的一个或多个字符的内容,但.
将不匹配控制字符(如换行符)。
所以匹配从\n
开始,并在下一个\n
之前结束一个字符。因此,您的split分隔符都是可见字符,这样的分割将产生空字符串,因为这些空字符串位于分隔符之间。
您的输入字符串以换行符开头和结尾,因此在使用\n
拆分后,第一个和最后一个空字符串。将模式修改为
(?<!^)\n(?<!$)
不匹配前导和尾随换行符。这使用负面的lookbehind和lookahead来排除那些第一个和最后一个。
答案 1 :(得分:0)
假设:
>>> txt
'\nModels: Introduction to models | Field types | Indexes | Meta options | Model class\nQuerySets: Making queries | QuerySet method reference | Lookup expressions\nModel instances: Instance methods | Accessing related objects\nMigrations: Introduction to Migrations | Operations reference | SchemaEditor | Writing migrations\nAdvanced: Managers | Raw SQL | Transactions | Aggregation | Search | Custom fields | Multiple databases | Custom lookups | Query Expressions | Conditional Expressions | Database Functions\nOther: Supported databases | Legacy databases | Providing initial data | Optimize database access | PostgreSQL specific features\n'
你可以这样做(我将其格式化为你想要的例子......):
>>> [e for e in txt.split('\n') if e]
['Models: Introduction to models | Field types | Indexes | Meta options | Model class',
'QuerySets: Making queries | QuerySet method reference | Lookup expressions',
'Model instances: Instance methods | Accessing related objects', 'Migrations: Introduction to Migrations | Operations reference | SchemaEditor | Writing migrations',
'Advanced: Managers | Raw SQL | Transactions | Aggregation | Search | Custom fields | Multiple databases | Custom lookups | Query Expressions | Conditional Expressions | Database Functions',
'Other: Supported databases | Legacy databases | Providing initial data | Optimize database access | PostgreSQL specific features']
同样的方法适用于re.split
:
>>> [e for e in re.split('\n+',txt) if e]
# same output...
或者,你可以捕捉到你想要的东西,然后分裂你不想要的东西。在这种情况下,请使用后视查找\n
之后的文本:
>>> re.findall(r'(?<=\n)([^\n]+)', txt)
# same output
答案 2 :(得分:0)
如垃圾收集器所提到的,您第一次尝试的问题是您的原始文本被\n
包围,所以在拆分时,您会得到一个空字符串作为表示此事实的第一个元素,同样到最后。
要解决这个问题,首先需要删除那些可以使用.strip方法轻松完成的内容:
>>> import re
>>> t = "\na a a\nb b b\nc c c\n"
>>> t.strip()
'a a a\nb b b\nc c c'
>>> re.split("\n",t.strip())
['a a a', 'b b b', 'c c c']
>>>
对于此描述的任务,您不需要re
模块,str class附带了大量方法来处理各种常见情况,{{3} }方法也会这样做
>>> t.strip().splitlines()
['a a a', 'b b b', 'c c c']
>>>
答案 3 :(得分:0)
我认为您可能会对文本中的正则表达式示例感到困惑。正则表达式中的\W+
匹配一个或多个非单词字符。所以空格,标点符号等是匹配的。因此re.split
会返回单词列表。
要使您的示例正常工作,您只需要删除.*
。
e.g。
import re
content = '\nModels: Introduction to models | Field types | Indexes | Meta options | Model class\nQuerySets: Making queries | QuerySet method reference | Lookup expressions\nModel instances: Instance methods | Accessing related objects\nMigrations: Introduction to Migrations | Operations reference | SchemaEditor | Writing migrations\nAdvanced: Managers | Raw SQL | Transactions | Aggregation | Search | Custom fields | Multiple databases | Custom lookups | Query Expressions | Conditional Expressions | Database Functions\nOther: Supported databases | Legacy databases | Providing initial data | Optimize database access | PostgreSQL specific features\n'
re.split(r'\n',content)
['', 'Models: Introduction to models | Field types | Indexes | Meta options | Model class', 'QuerySets: Making queries | QuerySet method reference | Lookup expressions', 'Model instances: Instance methods | Accessing related objects', 'Migrations: Introduction to Migrations | Operations reference | SchemaEditor | Writing migrations', 'Advanced: Managers | Raw SQL | Transactions | Aggregation | Search | Custom fields | Multiple databases | Custom lookups | Query Expressions | Conditional Expressions | Database Functions', 'Other: Supported databases | Legacy databases | Providing initial data | Optimize database access | PostgreSQL specific features', '']
另外,如果您知道自己要以\n
开始和结束,那么您可以将空白切掉:
re.split(r'\n',content)[1:-1]
['Models: Introduction to models | Field types | Indexes | Meta options | Model class', 'QuerySets: Making queries | QuerySet method reference | Lookup expressions', 'Model instances: Instance methods | Accessing related objects', 'Migrations: Introduction to Migrations | Operations reference | SchemaEditor | Writing migrations', 'Advanced: Managers | Raw SQL | Transactions | Aggregation | Search | Custom fields | Multiple databases | Custom lookups | Query Expressions | Conditional Expressions | Database Functions', 'Other: Supported databases | Legacy databases | Providing initial data | Optimize database access | PostgreSQL specific features']