提取数字和单词之间的文本

时间:2018-09-04 07:02:07

标签: python regex python-3.x

我有一个文件,其内容为:

01009700  Samsung  Samsung SGH-N625  GSM 1900,GSM 900  
01009800  Motorola  Motorola T194 EOTD  GSM 1900  

01009900  Option International  
,GSM 900  
01009901  Option International  

,GSM 1900,GSM 900 01009902 Option International ,GSM 1900,GSM 900 01009903 Option International ,GSM 1900,GSM 900 01009904 Option International ,GSM 1900,GSM 900 01009905 Option International ,GSM 1900,GSM 900 01009906 Option International ,GSM 1900,GSM 900 01009907 Option International ,GSM 1900,GSM 900 01009908 Option International ,GSM 1900,GSM 900 01009909 Option International ,GSM 1900,GSM 900 01009910 Option International ,GSM 1900,GSM 900 01009911 Option International ,GSM 1900,GSM 900 01009912 Option International ,GSM 1900,GSM 900 01009913 Option International ,GSM 1900,GSM 900 01009914 Option International ,GSM 1900,GSM 900 01009915 Option International ,GSM 1900,GSM 900 01009916 Option International ,GSM 1900,GSM 900 01009917 Option International ,GSM 1900,GSM 900 01009918 Option International ,GSM 1900,GSM 900 01009919 Option International ,GSM 1900,GSM 900 
Option Internati. Globetrotter GSM 1800 Option Internati. Globetrotter GSM 1800 Option Internati. Globetrotter GSM 1800 Option Internati. Globetrotter GSM 1800 Option Internati. Globetrotter GSM 1800 Option Internati. Globetrotter GSM 1800 Option Internati. Globetrotter GSM 1800 Option Internati. Globetrotter GSM 1800 Option Internati. Globetrotter GSM 1800 Option Internati. Globetrotter GSM 1800 Option Internati. Globetrotter GSM 1800 Option Internati. Globetrotter GSM 1800 Option Internati. Globetrotter GSM 1800 Option Internati. Globetrotter GSM 1800 Option Internati. Globetrotter GSM 1800 Option Internati. Globetrotter GSM 1800 Option Internati. Globetrotter GSM 1800 Option Internati. Globetrotter GSM 1800 Option Internati. Globetrotter GSM 1800 Option Internati. Globetrotter GSM 1800 
01010000  Sierra Wireless Sierra Wireless Aircard 710  GSM 1900  
01010100  Sierra Wireless Sierra Wireless Aircard 750  GSM 1800,GSM 190  
0,GSM 900 

我使用正则表达式,尝试从8位数字中提取任何内容,并在第一次出现GSM之前提取任何内容,例如:

01009700  Samsung  Samsung SGH-N625
01009800  Motorola  Motorola T194 EOTD
01009900  Option International
01009902  Option International
01009919  Option International
01010000  Sierra Wireless Sierra Wireless Aircard
01010100  Sierra Wireless Sierra Wireless Aircard

我尝试了\d{8}.+(GSM)?,但似乎不起作用。

什么是正确的正则表达式?

1 个答案:

答案 0 :(得分:4)

您可以使用

re.findall(r'\b(\d{8}.*?)\W*GSM', s)

请参见regex demo

详细信息

  • \b-单词边界(
  • (\d{8}.*?)-组1:八位数字,然后除换行符外的任何0+字符应尽可能少
  • \W*-任意0+个非单词字符
  • GSM-一个GSM子字符串。

Python demo

import re
s="""01009700  Samsung  Samsung SGH-N625  GSM 1900,GSM 900  
01009800  Motorola  Motorola T194 EOTD  GSM 1900  

01009900  Option International  
,GSM 900  
01009901  Option International  

,GSM 1900,GSM 900 01009902 Option International ,GSM 1900,GSM 900 01009903 Option International ,GSM 1900,GSM 900 01009904 Option International ,GSM 1900,GSM 900 01009905 Option International ,GSM 1900,GSM 900 01009906 Option International ,GSM 1900,GSM 900 01009907 Option International ,GSM 1900,GSM 900 01009908 Option International ,GSM 1900,GSM 900 01009909 Option International ,GSM 1900,GSM 900 01009910 Option International ,GSM 1900,GSM 900 01009911 Option International ,GSM 1900,GSM 900 01009912 Option International ,GSM 1900,GSM 900 01009913 Option International ,GSM 1900,GSM 900 01009914 Option International ,GSM 1900,GSM 900 01009915 Option International ,GSM 1900,GSM 900 01009916 Option International ,GSM 1900,GSM 900 01009917 Option International ,GSM 1900,GSM 900 01009918 Option International ,GSM 1900,GSM 900 01009919 Option International ,GSM 1900,GSM 900 
Option Internati. Globetrotter GSM 1800 Option Internati. Globetrotter GSM 1800 Option Internati. Globetrotter GSM 1800 Option Internati. Globetrotter GSM 1800 Option Internati. Globetrotter GSM 1800 Option Internati. Globetrotter GSM 1800 Option Internati. Globetrotter GSM 1800 Option Internati. Globetrotter GSM 1800 Option Internati. Globetrotter GSM 1800 Option Internati. Globetrotter GSM 1800 Option Internati. Globetrotter GSM 1800 Option Internati. Globetrotter GSM 1800 Option Internati. Globetrotter GSM 1800 Option Internati. Globetrotter GSM 1800 Option Internati. Globetrotter GSM 1800 Option Internati. Globetrotter GSM 1800 Option Internati. Globetrotter GSM 1800 Option Internati. Globetrotter GSM 1800 Option Internati. Globetrotter GSM 1800 Option Internati. Globetrotter GSM 1800 
01010000  Sierra Wireless Sierra Wireless Aircard 710  GSM 1900  
01010100  Sierra Wireless Sierra Wireless Aircard 750  GSM 1800,GSM 190  
0,GSM 900 """
print(re.findall(r"\b(\d{8}.*?)\W*GSM", s))

输出:

['01009700  Samsung  Samsung SGH-N625', '01009800  Motorola  Motorola T194 EOTD', '01009900  Option International', '01009901  Option International', '01009902 Option International', '01009903 Option International', '01009904 Option International', '01009905 Option International', '01009906 Option International', '01009907 Option International', '01009908 Option International', '01009909 Option International', '01009910 Option International', '01009911 Option International', '01009912 Option International', '01009913 Option International', '01009914 Option International', '01009915 Option International', '01009916 Option International', '01009917 Option International', '01009918 Option International', '01009919 Option International', '01010000  Sierra Wireless Sierra Wireless Aircard 710', '01010100  Sierra Wireless Sierra Wireless Aircard 750']