这不是another question的重复,因为我不想删除行。上述帖子中接受的答案与此非常不同,不旨在维护所有数据。
问题: 来自格式错误的csv文件的列数据中的分隔符
尝试过的解决方案: csv模块,shlex,StringIO(SO上没有工作解决方案)
示例数据
分隔符在第三个数据字段内,某处用(多个)双引号括起来:
08884624;6/4/2016;Network routing 21,5\"\" 4;8GHz1TB hddQwerty\"\";9999;resell:no;package:1;test
0085658;6/4/2016;Logic 111BLACK.compat: 29,46 cm (11.6\"\")deep: 4;06 cm height: 25;9 cm\"\";9999;resell:no;package:1;test
4235846;6/4/2016;Case Logic. compat: 39,624 cm (15.6\"\") deep: 3;05 cm height: 3 cm\"\";9999;resell:no;package:1;test
400015;6/4/2016;Cable\"\"Easy Cover\"\"\"\";1;5 m 30 Silver\"\";9999;resell:no;package:1;test
9791118;6/4/2016;Network routing 21,5\"\" (2013) 2;7GHz\"\";9999;resell:no;package:1;test
477000;6/4/2016;iGlaze. deep: 9,6 mm (67.378\"\") height: 14;13 cm\"\";9999;resell:no;package:1;test
4024001;6/4/2016;DigitalBOX. tuner: Digital, Power: 20 W., Diag: 7,32 cm (2.88\"\"). Speed 10;100 Mbit/s\"\";9999;resell:no;package:1;test
所需的样本输出
固定长度为7:
['08884624','6/4/2016', 'Network routing 21,5\" 4,8GHz1TB hddQwerty', '9999', 'resell:no', 'package:1', 'test']
通过csv阅读器解析并不能解决问题(跳过空间不是问题),shlex没用,而且StringIO也无济于事......
我最初的想法是逐行导入,并替换&#39 ;;'行中元素的元素。 但导入是问题所在,因为它分散在每个&#39 ;;#39;。
数据来自一个包含300.000+行的较大文件(并非所有行都有此问题)。 欢迎任何建议。
答案 0 :(得分:2)
如您所知输入字段的数量,并且由于只有一个字段格式错误,您可以简单地拆分public class RemoveUserAdapter extends BaseAdapter implements Filterable {
private ArrayList<UserModel> listcontent;
private ArrayList<UserModel> searchList;
Context context;
private Filter planetFilter;
UserModel content, content1;
LayoutInflater inflater;
public RemoveUserAdapter(Context context, ArrayList listContent) {
this.context = context;
this.listcontent = listContent;
this.searchList = listContent;
inflater = LayoutInflater.from(context);
}
@Override
public int getCount() {
return searchList.size();
}
@Override
public Object getItem(int i) {
return searchList.get(i);
}
@Override
public long getItemId(int i) {
return searchList.get(i).hashCode();
}
class ViewHolder {
private TextView tvEmail, tvUserId;
private CheckBox checkBox;
}
@Override
public View getView(int i, View view, ViewGroup viewGroup) {
final ViewHolder viewHolder, viewHolder1;
if (view == null) {
LayoutInflater inflater = (LayoutInflater) context.getSystemService(Context.LAYOUT_INFLATER_SERVICE);
view = inflater.inflate(R.layout.remove_user_adapter, null);
viewHolder = new ViewHolder();
viewHolder.tvEmail = (TextView) view.findViewById(R.id.tvEmailUser);
viewHolder.tvUserId = (TextView) view.findViewById(R.id.tvUserGroup);
viewHolder.checkBox = (CheckBox) view.findViewById(R.id.checkBox);
viewHolder.checkBox.setVisibility(View.INVISIBLE);
view.setTag(viewHolder);
} else {
viewHolder = (ViewHolder) view.getTag();
}
content = (UserModel) getItem(i);
viewHolder.tvEmail.setText(content.getEmail());
viewHolder.tvUserId.setText(content.getUsergroup());
return view;
}
public void resetData() {
searchList=listcontent;
}
@Override
public Filter getFilter() {
Filter filter = new Filter() {
@Override
protected void publishResults(CharSequence charSequence, FilterResults result) {
searchList = (ArrayList<UserModel>) result.values;
notifyDataSetChanged();
}
@Override
protected FilterResults performFiltering(CharSequence constraint) {
FilterResults results = new FilterResults();
ArrayList<UserModel> FilteredArrList = new ArrayList<UserModel>();
if (constraint == null || constraint.length() == 0) {
// set the Original result to return
results.count = listcontent.size();
results.values = listcontent;
} else {
constraint = constraint.toString().toLowerCase();
for (int i = 0; i < listcontent.size(); i++) {
content = (UserModel) listcontent.get(i);
String data = content.getEmail();
String user = content.getUserId();
if (data.toLowerCase().startsWith(constraint.toString())) {
FilteredArrList.add(content);
}
}
results.count = FilteredArrList.size();
results.values = FilteredArrList;
}
return results;
}
};
return filter;
}
,然后将中间字段合并为一个字段:
;
我甚至没有尝试处理双引号,因为我无法理解你是如何从for line in file:
temp_l = line.split(';')
lst = temp_l[:2] + [ ';'.join(l[2:-4]) ] + l[-4:] #lst should contain the expected fields
转到Network routing 21,5\"\" 4;8GHz1TB hddQwerty\"\"
...
答案 1 :(得分:0)
您可以使用标准csv模块。
要实现您想要实现的目标,只需将问题中的csv分隔符更改为“;”
在终端中测试以下内容:
import csv
test = ["4024001;6/4/2016;DigitalBOX. tuner: Digital, Power: 20 W., Diag: 7,32 cm (2.88\"\"). Speed 10;100 Mbit/s\"\";9999;resell:no;package:1;test"]
delimited_colon = list(csv.reader(b, delimiter=";", skipinitialspace=True))