我使用正则表达式编写了一个用于数据类型检测的小程序。我过去曾参与过这个项目,并从这个精彩的社区中获得了很多帮助。我打算将这个代码用于当前的项目,但我发现我遇到了正确识别浮点数的问题。
此代码的目标是将csv作为字符串读入,标识每列的数据类型,然后将列转换为该数据类型。我正在测试的示例CSV在这里:
我的代码:
import pandas as pd
import numpy as np
from tabulate import tabulate
from datetime import datetime
from pandas.compat import StringIO
import re
df = pd.read_csv(pathname, dtype=str)
df = df.reset_index()
del df['index']
lst = list(df.columns.values)
numrows = df.shape[0]
numcols = df.shape[1]
col = 0
row = 0
date_count = []
int_count = []
str_count = []
boolean_count = []
float_count = []
time_count = []
dict = {}
keys = []
vals = []
variable_1 = 0
while col < numcols:
while row < numrows:
var2 = str(df.ix[row][col])
# How to match all the data types:
str_pattern = re.findall(r'\b\w+\b', var2)
str_count = str_count + [str_pattern]
int_pattern = re.findall(r'(?:\s|^)(\d+)(?:\s|$)', var2)
int_count = int_count + [int_pattern]
float_pattern = re.findall(r'^\d+\.\d+$', var2)
float_count = float_count + [float_pattern]
#boolean_pattern = re.findall(r'TRUE|FALSE|True|False|true|false|t|f|T|F', var2)
boolean_pattern = re.findall(r'^TRUE$|^FALSE$|^True$|^False$|^true$|^false$|^t$|^f$|^T$|^F$', var2)
boolean_count = boolean_count + [boolean_pattern]
date_pattern = re.findall(r'(\d\d?|[a-zA-Z]{2,8})([:/-])(\d\d?)\2(\d{2,4})', var2)
date_count = date_count + [date_pattern]
time_pattern = re.findall(r'(\d{1,2})(?:[\:]{1})(\d{1,2})(?:[\:]{1})(\d{1,2})', var2)
time_count = time_count + [time_pattern]
# How to clear out all the empty values in the array
str_count = [x for x in str_count if x != []]
int_count = [x for x in int_count if x != []]
float_count = [x for x in float_count if x != []]
boolean_count = [x for x in boolean_count if x != []]
date_count = [x for x in date_count if x != []]
row = row + 1
# Changing the column data types
if len(int_count) == len(str_count):
df[lst[col]] = pd.to_numeric(df[lst[col]], errors='coerce', downcast='integer')
if len(float_count) == len(str_count):
df[lst[col]] = pd.to_numeric(df[lst[col]], errors='coerce', downcast='float')
if len(boolean_count) == len(str_count):
df[lst[col]] = df[lst[col]].astype('bool')
if len(date_count) == len(str_count):
df[lst[col]] = pd.to_datetime(df[lst[col]], errors='coerce')
del str_count[:]
del int_count[:]
del float_count[:]
del boolean_count[:]
del date_count[:]
# Converting any column that has type object into a string
df.update(df.select_dtypes(include=[np.object]).astype(str))
col = col + 1
row = 0
#Creating Key to create dictionary
keys = list(df.columns.values)
print(df.dtypes)
输出:
当我运行此代码并将示例CSV路径名放入read_csv时,一切都正常运行但由于某种原因,&#34;地址&#34;列被返回为Float类型。我去了regex101.com并试着玩我的正则表达式,它运行正常。
任何帮助都会很棒!
以下是示例数据:
Date,Name,Address,Age,Married
10/10/10,Alice,123 Main Street,21,FALSE
12/12/12,Bob,830 East Jefferson Street,30,TRUE
11/11/11,Rohin,6616 Majestic Way,21,FALSE
答案 0 :(得分:0)
它没有将地址识别为浮动;它将其识别为整数,在to_numeric
失败,从而忽略downcast
。试试这个:
pd.to_numeric(df['Address'], errors='coerce', downcast='integer')
您会看到它返回的是一列NaN,类型为float64
。您将所有地址作为整数进行匹配,因为列包含整数,并且您的整数正则表达式匹配它们,因为它的空格分隔。如果您没有errors='coerce'
,那么您可能已经看到了正在发生的事情。
编辑:为了澄清一点,你需要做的是编辑你的整数正则表达式,这样只有当整个字段是一个带有可选的前导或尾随空格的整数时它才匹配。
re.findall(r'(^\s?\d+\s?$)'
这将匹配&#39; 123&#39;或者&#39; 123&#39;但不是&#39; 123 Main Street&#39;。