我有一个要求,用户将以下面的格式上传csv文件,该文件将包含大约1.8到200万条记录
SITE_ID,HOUSE,STREET,CITY,STATE,ZIP,APARTMENT
44,545395,PORT ROYAL,CORPUS CHRISTI,TX,78418,2
44,608646,TEXAS AVE,ODESSA,TX,79762,
44,487460,EVERHART RD,CORPUS CHRISTI,TX,78413,
44,275543,EDWARD GARY,SAN MARCOS,TX,78666,4
44,136811,MAGNOLIA AVE,SAN ANTONIO,TX,78212
我要做的是,首先验证文件,然后将其保存在数据库中,如果验证成功且没有错误。我必须应用的验证对于每列是不同的。例如,
SITE_ID: it can only be an integer and it is required.
HOUSE: integer, required
STREET: alphanumeric, required
CITY: alphabets only, required
State: 2 alphabets only, required
zip: 5 digits only, required
APARTMENT: integer only, optional
我需要一种将这些验证应用于各个列的通用方法。我到目前为止尝试的是我将csv文件转换为dataTable,我打算尝试通过正则表达式验证每个单元格,但这对我来说似乎不是一个通用或好的解决方案。任何人都可以在这方面帮助我,并指出我正确的方向吗?
答案 0 :(得分:2)
这是一种相当过分但非常有趣的通用方法,您可以在其中为类提供属性,使其与CSV列标题相匹配:
第一步是解析您的CSV。有很多种方法,但我最喜欢的是TextFieldParser
that can be found in the Microsoft.VisualBasic.FileIO
namespace。使用它的好处是它100%原生;您需要做的就是在引用中添加Microsoft.VisualBasic
。
完成后,您将数据设为List<String[]>
。现在,事情变得有趣了。请参阅,现在我们可以创建自定义属性并将其添加到我们的类属性中:
属性类:
[AttributeUsage(AttributeTargets.Property)]
public sealed class CsvColumnAttribute : System.Attribute
{
public String Name { get; private set; }
public Regex ValidationRegex { get; private set; }
public CsvColumnAttribute(String name) : this(name, null) { }
public CsvColumnAttribute(String name, String validationRegex)
{
this.Name = name;
this.ValidationRegex = new Regex(validationRegex ?? "^.*$");
}
}
数据类:
public class AddressInfo
{
[CsvColumnAttribute("SITE_ID", "^\\d+$")]
public Int32 SiteId { get; set; }
[CsvColumnAttribute("HOUSE", "^\\d+$")]
public Int32 House { get; set; }
[CsvColumnAttribute("STREET", "^[a-zA-Z0-9- ]+$")]
public String Street { get; set; }
[CsvColumnAttribute("CITY", "^[a-zA-Z0-9- ]+$")]
public String City { get; set; }
[CsvColumnAttribute("STATE", "^[a-zA-Z]{2}$")]
public String State { get; set; }
[CsvColumnAttribute("ZIP", "^\\d{1,5}$")]
public Int32 Zip { get; set; }
[CsvColumnAttribute("APARTMENT", "^\\d*$")]
public Int32? Apartment { get; set; }
}
如您所见,我在这里所做的是将每个属性链接到CSV列名称,并给它一个正则表达式来验证内容。对于非必需的东西,你仍然可以使用正则表达式,但是允许空值的正则表达式,如公寓1中所示。
现在,要实际将列与CSV标头匹配,我们需要获取AddressInfo
类的属性,检查每个属性是否有CsvColumnAttribute
,如果匹配,则匹配它的名称是CSV文件数据的列标题。完成后,我们得到了一个PropertyInfo
对象列表,可以用来动态填写为所有行创建的新对象的属性。
此方法完全通用,允许在CSV文件中以任何顺序提供列,并且在将CsvColumnAttribute
分配给要填写的属性后,解析将适用于任何类。它将自动验证数据,您可以随心所欲地处理故障。在这段代码中,我所做的只是跳过无效行。
public static List<T> ParseCsvInfo<T>(List<String[]> split) where T : new()
{
// No template row, or only a template row but no data. Abort.
if (split.Count < 2)
return new List<T>();
String[] templateRow = split[0];
// Create a dictionary of rows and their index in the file data.
Dictionary<String, Int32> columnIndexing = new Dictionary<String, Int32>();
for (Int32 i = 0; i < templateRow.Length; i++)
{
// ToUpperInvariant is optional, of course. You could have case sensitive headers.
String colHeader = templateRow[i].Trim().ToUpperInvariant();
if (!columnIndexing.ContainsKey(colHeader))
columnIndexing.Add(colHeader, i);
}
// Prepare the arrays of property parse info. We set the length
// so the highest found column index exists in it.
Int32 numCols = columnIndexing.Values.Max() + 1;
// Actual property to fill in
PropertyInfo[] properties = new PropertyInfo[numCols];
// Regex to validate the string before parsing
Regex[] propValidators = new Regex[numCols];
// Type converters for automatic parsing
TypeConverter[] propconverters = new TypeConverter[numCols];
// go over the properties of the given type, see which ones have a
// CsvColumnAttribute, and put these in the list at their CSV index.
foreach (PropertyInfo p in typeof(T).GetProperties())
{
object[] attrs = p.GetCustomAttributes(true);
foreach (Object attr in attrs)
{
CsvColumnAttribute csvAttr = attr as CsvColumnAttribute;
if (csvAttr == null)
continue;
Int32 index;
if (!columnIndexing.TryGetValue(csvAttr.Name.ToUpperInvariant(), out index))
{
// If no valid column is found, and the regex for this property
// does not allow an empty value, then all lines are invalid.
if (!csvAttr.ValidationRegex.IsMatch(String.Empty))
return new List<T>();
// No valid column found: ignore this property.
break;
}
properties[index] = p;
propValidators[index] = csvAttr.ValidationRegex;
// Automatic type converter. This function could be enhanced by giving a
// list of custom converters as extra argument and checking those first.
propconverters[index] = TypeDescriptor.GetConverter(p.PropertyType);
break; // Only handle one CsvColumnAttribute per property.
}
}
List<T> objList = new List<T>();
// start from 1 since the first line is the template with the column names
for (Int32 i = 1; i < split.Count; i++)
{
Boolean abortLine = false;
String[] line = split[i];
// make new object of the given type
T obj = new T();
for (Int32 col = 0; col < properties.Length; col++)
{
// It is possible a line is not long enough to contain all columns.
String curVal = col < line.Length ? line[col] : String.Empty;
PropertyInfo prop = properties[col];
// this can be null if the column was not found but wasn't required.
if (prop == null)
continue;
// check validity. Abort buildup of this object if not valid.
Boolean valid = propValidators[col].IsMatch(curVal);
if (!valid)
{
// Add logging here? We have the line and column index.
abortLine = true;
break;
}
// Automated parsing. Always use nullable types for nullable properties.
Object value = propconverters[col].ConvertFromString(curVal);
prop.SetValue(obj, value, null);
}
if (!abortLine)
objList.Add(obj);
}
return objList;
}
要在您的CSV文件上使用,只需执行
即可// the function using VB's TextFieldParser
List<String[]> splitData = SplitFile(datafile, new UTF8Encoding(false), ',');
// The above function, applied to the AddressInfo class
List<AddressInfo> addresses = ParseCsvInfo<AddressInfo>(splitData);
就是这样。自动解析和验证,通过类属性上的一些添加属性。
请注意,如果事先拆分数据会给大数据带来太大的性能损失,那确实不是问题; TextFieldParser
工作于Stream
中的TextReader
,因此您可以只提供一个流并在List<String[]>
内进行csv解析而不是ParseCsvInfo
。 {1}}功能,只需直接从TextFieldParser
读取每条CSV行。
我在这里没有这样做,因为我将读者写入List<String[]>
的csv阅读的原始用例包括自动编码检测,无论如何都需要读取整个文件。
答案 1 :(得分:1)
我建议使用CSV库来读取文件 例如,您可以使用LumenWorksCsvReader:https://www.nuget.org/packages/LumenWorksCsvReader
使用正则表达式验证的方法实际上是可以的。 例如,您可以创建“验证字典”并根据正则表达式检查每个CSV值。
然后,您可以构建一个可以使用这样的“验证字典”验证CSV文件的函数。
见这里:
string lsInput = @"SITE_ID,HOUSE,STREET,CITY,STATE,ZIP,APARTMENT
44,545395,PORT ROYAL,CORPUS CHRISTI,TX,78418,2
44,608646,TEXAS AVE,ODESSA,TX,79762,
44,487460,EVERHART RD,CORPUS CHRISTI,TX,78413,
44,275543,EDWARD GARY,SAN MARCOS,TX,78666,4
44,136811,MAGNOLIA AVE,SAN ANTONIO,TX,78212";
Dictionary<string, string> loValidations = new Dictionary<string, string>();
loValidations.Add("SITE_ID", @"^\d+$"); //it can only be an integer and it is required.
//....
bool lbValid = true;
using (CsvReader loCsvReader = new CsvReader(new StringReader(lsInput), true, ','))
{
while (loCsvReader.ReadNextRecord())
{
foreach (var loValidationEntry in loValidations)
{
if (!Regex.IsMatch(loCsvReader[loValidationEntry.Key], loValidationEntry.Value))
{
lbValid = false;
break;
}
}
if (!lbValid)
break;
}
}
Console.WriteLine($"Valid: {lbValid}");
答案 2 :(得分:1)
这是一种有效的方法:
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Data;
using System.Data.OleDb;
using System.Text.RegularExpressions;
using System.IO;
namespace ConsoleApplication23
{
class Program
{
const string FILENAME = @"c:\temp\test.csv";
static void Main(string[] args)
{
CSVReader csvReader = new CSVReader();
DataSet ds = csvReader.ReadCSVFile(FILENAME, true);
RegexCompare compare = new RegexCompare();
DataTable errors = compare.Get_Error_Rows(ds.Tables[0]);
}
}
class RegexCompare
{
public static Dictionary<string,RegexCompare> dict = new Dictionary<string,RegexCompare>() {
{ "SITE_ID", new RegexCompare() { columnName = "SITE_ID", pattern = @"[^\d]+", positveNegative = false, required = true}},
{ "HOUSE", new RegexCompare() { columnName = "HOUSE", pattern = @"[^\d]+", positveNegative = false, required = true}},
{ "STREET", new RegexCompare() { columnName = "STREET", pattern = @"[A-Za-z0-9 ]+", positveNegative = true, required = true}},
{ "CITY", new RegexCompare() { columnName = "CITY", pattern = @"[A-Za-z ]+", positveNegative = true, required = true}},
{ "STATE", new RegexCompare() { columnName = "STATE", pattern = @"[A-Za-z]{2}", positveNegative = true, required = true}},
{ "ZIP", new RegexCompare() { columnName = "ZIP", pattern = @"\d{5}", positveNegative = true, required = true}},
{ "APARTMENT", new RegexCompare() { columnName = "APARTMENT", pattern = @"\d*", positveNegative = true, required = false}},
};
string columnName { get; set;}
string pattern { get; set; }
Boolean positveNegative { get; set; }
Boolean required { get; set; }
public DataTable Get_Error_Rows(DataTable dt)
{
DataTable dtError = null;
foreach (DataRow row in dt.AsEnumerable())
{
Boolean error = false;
foreach (DataColumn col in dt.Columns)
{
RegexCompare regexCompare = dict[col.ColumnName];
object colValue = row.Field<object>(col.ColumnName);
if (regexCompare.required)
{
if (colValue == null)
{
error = true;
break;
}
}
else
{
if (colValue == null)
continue;
}
string colValueStr = colValue.ToString();
Match match = Regex.Match(colValueStr, regexCompare.pattern);
if (regexCompare.positveNegative)
{
if (!match.Success)
{
error = true;
break;
}
if (colValueStr.Length != match.Value.Length)
{
error = true;
break;
}
}
else
{
if (match.Success)
{
error = true;
break;
}
}
}
if(error)
{
if (dtError == null) dtError = dt.Clone();
dtError.Rows.Add(row.ItemArray);
}
}
return dtError;
}
}
public class CSVReader
{
public DataSet ReadCSVFile(string fullPath, bool headerRow)
{
string path = fullPath.Substring(0, fullPath.LastIndexOf("\\") + 1);
string filename = fullPath.Substring(fullPath.LastIndexOf("\\") + 1);
DataSet ds = new DataSet();
try
{
if (File.Exists(fullPath))
{
string ConStr = string.Format("Provider=Microsoft.Jet.OLEDB.4.0;Data Source={0}" + ";Extended Properties=\"Text;HDR={1};FMT=Delimited\\\"", path, headerRow ? "Yes" : "No");
string SQL = string.Format("SELECT * FROM {0}", filename);
OleDbDataAdapter adapter = new OleDbDataAdapter(SQL, ConStr);
adapter.Fill(ds, "TextFile");
ds.Tables[0].TableName = "Table1";
}
foreach (DataColumn col in ds.Tables["Table1"].Columns)
{
col.ColumnName = col.ColumnName.Replace(" ", "_");
}
}
catch (Exception ex)
{
Console.WriteLine(ex.Message);
}
return ds;
}
}
}
答案 3 :(得分:0)
这是使用Cinchoo ETL - 一个开源文件帮助程序库来完成您的需求的另一种方法。
首先使用DataAnnonations验证属性定义一个POCO类,如下所示
sns.FactorPlot(x='category_value', y='price', col='category', data=df)
然后使用此类与ChoCSVReader一起加载并使用Validate()/ IsValid()方法检查文件的有效性,如下所示
public class Site
{
[Required(ErrorMessage = "SiteID can't be null")]
public int SiteID { get; set; }
[Required]
public int House { get; set; }
[Required]
public string Street { get; set; }
[Required]
[RegularExpression("^[a-zA-Z][a-zA-Z ]*$")]
public string City { get; set; }
[Required(ErrorMessage = "State is required")]
[RegularExpression("^[A-Z][A-Z]$", ErrorMessage = "Incorrect zip code.")]
public string State { get; set; }
[Required]
[RegularExpression("^[0-9][0-9]*$")]
public string Zip { get; set; }
public int Apartment { get; set; }
}
希望它有所帮助。
免责声明:我是这个图书馆的作者。