我正在阅读包含电影标题,年份,语言等的文本文件。 我试图抓住这些属性。
假设某些字符串是这样的:
div
我如何指定标题,年份,国家/地区,如果指定了什么类型的标题?
我不擅长使用正则表达式和模式,但我不知道如何在未指定时找到它的属性。我这样做是因为我试图从文本文件生成xml。我有它的dtd但我不确定在这种情况下我需要它使用它。
编辑:这是我尝试过的。
String s = "A Fatal Inversion" (1992)"
String d = "(aka "Verhngnisvolles Erbe" (1992)) (Germany)"
String f = "\"#Yaprava\" (2013) "
String g = "(aka \"Love Heritage\" (2002)) (International: English title)"
答案 0 :(得分:1)
我建议你先提取年份,因为这似乎相当一致。然后我将提取国家(如果有的话),其余的我将假设是标题。
为了提取国家/地区,我建议您使用已知国家/地区的名称对正则表达式进行硬编码。可能需要一些迭代来确定这些是什么,因为它们似乎非常不一致。
这段代码有点难看(但数据也是如此!):
public class Extraction {
public final String original;
public String year = "";
public String title = "";
public String country = "";
private String remaining;
public Extraction(String s) {
this.original = s;
this.remaining = s;
extractBracketedYear();
extractBracketedCountry();
this.title = remaining;
}
private void extractBracketedYear() {
Matcher matcher = Pattern.compile(" ?\\(([0-9]+)\\) ?").matcher(remaining);
StringBuffer sb = new StringBuffer();
while (matcher.find()) {
this.year = matcher.group(1);
matcher.appendReplacement(sb, "");
}
matcher.appendTail(sb);
remaining = sb.toString();
}
private void extractBracketedCountry() {
Matcher matcher = Pattern.compile("\\((Germany|International: English.*?)\\)").matcher(remaining);
StringBuffer sb = new StringBuffer();
while (matcher.find()) {
this.country = matcher.group(1);
matcher.appendReplacement(sb, "");
}
matcher.appendTail(sb);
remaining = sb.toString();
}
public static void main(String... args) {
for (String s : new String[] {
"A Fatal Inversion (1992)",
"(aka \"Verhngnisvolles Erbe\" (1992)) (Germany)",
"\"#Yaprava\" (2013) ",
"(aka \"Love Heritage\" (2002)) (International: English title)"}) {
Extraction extraction = new Extraction(s);
System.out.println("title = " + extraction.title);
System.out.println("country = " + extraction.country);
System.out.println("year = " + extraction.year);
System.out.println();
}
}
}
产地:
title = A Fatal Inversion
country =
year = 1992
title = (aka "Verhngnisvolles Erbe")
country = Germany
year = 1992
title = "#Yaprava"
country =
year = 2013
title = (aka "Love Heritage")
country = International: English title
year = 2002
获得此数据后,您可以进一步操作(例如“国际:英文标题” - >“英格兰”)。