我想在文本数据集中构建1到1亿个探测器,因为在某些我的母语中,'。和','正在改变意义('。'数千和','美分)
这是我的数据:
id Body
1 You 're get 4500000
2 Congrats, you receive 500000
3 Congrats, you receive 5.000.000
4 Congrats, you get 2.000.000,00!
5 Your verification code is 600700800
这是我的预期输出
id Body millons
1 You 're get 4500000 4500000
2 Congrats, you receive 500000 0
3 Congrats, you receive 5.000.000 5000000
4 Congrats, you get 2.000.000,00! 2000000
5 Your verification code is 600700800 0
它们为零,因为它不在所需的数字范围内,即1000000
- 100000000
我做的是:
df['number'] = df['body'].str.findall(r'[0-9]').str.len()
然后我过滤:
df[(df['number']<9) & (df['number']>6)
答案 0 :(得分:1)
使用更好的re
模式,可以使用Series.str.extract
df_str = ''' id Body
1 You 're get 4500000
2 Congrats, you receive 500000
3 Congrats, you receive 5.000.000
4 Congrats, you get 2.000.000,00!
5 Your verification code is 600700800
6 this line has no numbers
7 this line has malformed numbers 5.00,8
'''
df = pd.read_csv(StringIO(df_str), sep='\s\s+', engine='python', index_col=0)
pattern = r'((?:\d+)(?:\.\d{3})*(?:,\d+)?)'
numbers = df['Body'].str.extract(pattern, expand=False)
number_floats = numbers.str.replace('.', '').str.replace(',', '.').apply(float)
in_range = (1E6 <= number_floats) & (number_floats <= 1E8)
df['millions'] = number_floats.where(in_range, 0)
id Body millions 1 You 're get 4500000 4500000.0 2 Congrats, you receive 500000 0.0 3 Congrats, you receive 5.000.000 5000000.0 4 Congrats, you get 2.000.000,00! 2000000.0 5 Your verification code is 600700800 0.0 6 this line has no numbers 0.0 7 this line has malformed numbers 5.00,8 0.0
它仅在1行中有多个数字行为错误
(
(?:\d+) # a number of digits
(?:\.\d{3})* # a `.` followed by a group of 3 digits; optional, multiple possible
(?:,\d+)? # a `,` followed by a number of digits; optional
)
(?:
表示不会单独捕获这些子组