我正在处理大约10,500行的庞大数据集,这些数据集需要拆分成单独的部分,包括标题,日期,等级和长度。数据的格式如下:Ghost Blues: The Story of Rory Gallagher (2010) | 3.8 stars, 1hr 21m
我已经弄清楚了如何使用.split将数据分成两半,但是我不确定在标题中带有括号的情况下如何将标题的前半部分和后半部分拆分为标题和日期也可以,例如:Dhobi Ghat (Mumbai Diaries) (2010) | 3.6 stars, 1hr 42m
。
在某些情况下,其中某些字段可能为空,因此没有等级,日期或长度,这也给我带来了一些问题。谁能指出我正确的方向?任何帮助将不胜感激!
编辑:因此,我忘了提及(对不起),我需要任何日期和等级作为整数,因为稍后我将需要应用过滤器,例如搜索所有具有等级>的条目。 3.5,或1998年以后的电影,是这样的。这给我仍在使用的工具带来了麻烦。谢谢您到目前为止提供的所有帮助!
答案 0 :(得分:1)
尝试一下,测试了一些边缘情况,如注释所示:-
pax> cat inputFile
A B C D
E F G H
pax> awk '{printf "%s %s\n%s %s\n", $1, $2, $3, $4}' <inputFile
A B
C D
E F
G H
输出
public static void main(String[] args) {
String s = "Ghost Blues: The Story of Rory Gallagher (2010) | 3.8 stars, 1hr 21m";
//String s = "Ghost Blues: The Story of Rory Gallagher | 3.8 stars, 1hr 21m"; //no year
//String s = "Ghost Blues: The Story of Rory Gallagher (2010) | 3.8 stars"; //no length
Pattern p = Pattern.compile("(.*?)( (\\((\\d{4})\\)))? \\|\\s+(\\d(\\.\\d)?) stars(, (\\dhr( \\d{1,2}m)?))?");
Matcher m = p.matcher(s);
if (m.find()) {
System.out.println(m.group(1)); //title
System.out.println(m.group(4)); //year
System.out.println(m.group(5)); //rating
System.out.println(m.group(8)); //length
}
}
如果可以提供边缘情况的示例,则可以进一步改进。
答案 1 :(得分:0)
这是一个解决方案:
public class Title {
private String title;
private String year;
private String rating;
private String length;
public Title(String input) {
String[] leftRight = input.split("\\|");
title = leftRight[0].trim();
int lastParen = title.lastIndexOf("(");
if (lastParen > 0) {
year = title.substring(lastParen+1);
title = title.substring(0, lastParen).trim();
}
if (leftRight.length>1) {
String[] fields = leftRight[1].split(",");
for (int i = 0; i < fields.length; i++) {
if (fields[i].contains("stars")) {
rating = fields[i].trim();
} else {
length = fields[i].trim();
}
}
}
}
@Override
public String toString() {
return "Title{" + "title=" + title + ", year=" + year + ", rating=" + rating + ", length=" + length + '}';
}
public static void main(String[] args) {
String[] data = {
"Ghost Blues: The Story of Rory Gallagher (2010) | 3.8 stars, 1hr 21m",
"Dhobi Ghat (Mumbai Diaries) (2010) | 3.6 stars, 1hr 42m",
"just a title",
"title and rating only | 3.2 stars",
"title and length only | 1hr 30m"
};
for (String titleString : data) {
Title t = new Title(titleString);
System.out.println(t);
}
}
}
这是测试数据的输出:
Title{title=Ghost Blues: The Story of Rory Gallagher, year=2010), rating=3.8 stars, length=1hr 21m}
Title{title=Dhobi Ghat (Mumbai Diaries), year=2010), rating=3.6 stars, length=1hr 42m}
Title{title=just a title, year=null, rating=null, length=null}
Title{title=title and rating only, year=null, rating=3.2 stars, length=null}
Title{title=title and length only, year=null, rating=null, length=1hr 30m}