我如何建立一个数百万"检测器在文本数据中

时间:2018-03-16 08:40:13

标签: python pandas dataframe

我想在文本数据集中构建1到1亿个探测器,因为在某些我的母语中,'。和','正在改变意义('。'数千和','美分)

这是我的数据:

 id    Body
  1    You 're get 4500000
  2    Congrats, you receive 500000
  3    Congrats, you receive 5.000.000
  4    Congrats, you get 2.000.000,00!
  5    Your verification code is 600700800

这是我的预期输出

 id    Body                                   millons
  1    You 're get 4500000                    4500000
  2    Congrats, you receive 500000           0
  3    Congrats, you receive 5.000.000        5000000
  4    Congrats, you get 2.000.000,00!        2000000
  5    Your verification code is 600700800    0

它们为零,因为它不在所需的数字范围内,即1000000 - 100000000

我做的是:

df['number'] = df['body'].str.findall(r'[0-9]').str.len()

然后我过滤:

df[(df['number']<9) & (df['number']>6)

1 个答案:

答案 0 :(得分:1)

使用更好的re模式,可以使用Series.str.extract

完成此操作
df_str = ''' id    Body
  1    You 're get 4500000
  2    Congrats, you receive 500000
  3    Congrats, you receive 5.000.000
  4    Congrats, you get 2.000.000,00!
  5    Your verification code is 600700800
  6    this line has no numbers
  7    this line has malformed numbers 5.00,8
  '''
df = pd.read_csv(StringIO(df_str), sep='\s\s+', engine='python', index_col=0)

pattern = r'((?:\d+)(?:\.\d{3})*(?:,\d+)?)'
numbers = df['Body'].str.extract(pattern, expand=False)
number_floats = numbers.str.replace('.', '').str.replace(',', '.').apply(float)
in_range = (1E6 <= number_floats) & (number_floats <= 1E8)
df['millions'] = number_floats.where(in_range, 0)
id  Body                                    millions
1   You 're get 4500000                     4500000.0
2   Congrats, you receive                   500000    0.0
3   Congrats, you receive 5.000.000         5000000.0
4   Congrats, you get 2.000.000,00!         2000000.0
5   Your verification code is               600700800 0.0
6   this line has no numbers                0.0
7   this line has malformed numbers 5.00,8  0.0

它仅在1行中有多个数字行为错误

重新模式

(
(?:\d+)         # a number of digits
(?:\.\d{3})*    # a `.` followed by a group of 3 digits; optional, multiple possible
(?:,\d+)?       # a `,` followed by a number of digits; optional
)

(?:表示不会单独捕获这些子组