我正在编写一个从文本文件中的行中获取数据的程序。问题是它不是最好的书面文本文件,并且在尝试为文件编写解析器时存在很多混淆
这里有两个这样的行,我可以得到地址和纬度和经度变量,但在第二个我无法得到价格或尺寸。我一直得到的错误是超出-41(严重)
的字符串超出范围|12091805|,|0|,|DETAILS|,||,||,|Latitude:54.593406, Longitude:-5.934344 <b >Unit 8 Great Northern Mall Great Victoria Street Belfast Down<//b><p><p><p>Price : 150,000<p>Size: 2,411 Sq Feet ()<p>Rent : 50,500 Per Annum<p><p>Text<p><p>|,||,||
|15961081|,|0|,|DETAILS|,||,||,|<p>Latitude:54.593406, Longitude:-5.934344 <b>3-5 Market Street Lurgan BT66</b> </p> <p> </p> <p> </p> <p> Price : £250,000 </p> <p> Size: 0.173 acres (0.07ha) </p> <p> </p> <p> Text </p> <p> </p> <p> Text </p> <p> </p> <p> Text </p> <p> </p> <p> </p>|,||,||
它的篇幅要长很多,但我现在更改段落只是为了说文字。
不,我不能重写文本文件。任何指针都将不胜感激
if (s.contains("Price"))
{
int pstart = 0;
int pend = 0;
if (s.contains("<p>Size"))
{
//if has pound symbol
if (s.contains("£"))
{
String[] str = s.split("£");
StringBuilder bs = new StringBuilder();
for (String st : str)
{
bs.append(st);
}
pstart = bs.indexOf("Price") + 8;
pend = bs.indexOf("</p>") - 1;
}
else
{
pstart = s.indexOf("Price") + 8;
pend = s.indexOf("<p>Size");
}
String sp = s.substring(pstart, pend);
String[] spl = sp.split(",");
StringBuilder build = new StringBuilder();
for (String st : spl)
{
build.append(st);
f = build.toString();
}
in = Integer.parseInt(f);
p.setPrice(in);
}
else
{
if (s.contains("£"))
{
String[] str = s.split("£");
StringBuilder bs = new StringBuilder();
for (String st : str)
{
bs.append(st);
}
pstart = bs.indexOf("Price : ");
pend = bs.indexOf("</p>") - 1;
}
else
{
pstart = s.indexOf("Price") + 8;
pend = s.indexOf("<p>Size");
}
String sp = s.substring(pstart, pend);
String[] spl = sp.split(",");
StringBuilder build = new StringBuilder();
for (String st : spl)
{
build.append(st);
f = build.toString();
}
in = Integer.parseInt(f);
p.setPrice(in);
}
}
// if has size property
if (s.contains("Size"))
{
//if in acres
if (s.contains("acres"))
{
int sstart = s.indexOf("Size:") + 6;
int send = s.indexOf("acres") - 1;
String sp = s.substring(sstart, send);
double d = Double.parseDouble(sp);
p.setSized(d);
}
if (s.contains("()"))
{
int sstart = s.indexOf("Size:") + 6;
int send = s.indexOf("Sq") - 2;
String sp = s.substring(sstart, send);
if (sp.contains("-") && sp.contains(","))
{
String[] spl = sp.split("-|,");
StringBuilder str = new StringBuilder();
str.append(spl[0] + spl[1]);
StringBuilder str2 = new StringBuilder(0);
str2.append(spl[2] + spl[3]);
String s1 = str.toString();
int i = Integer.parseInt(s1);
p.setSize(i);
String s2 = str2.toString();
i = Integer.parseInt(s2);
p.setSize2(i);
}
if (sp.contains("-"))
{
String[] spl = sp.split("-");
int one = Integer.parseInt(spl[0]);
p.setSize(one);
int two = Integer.parseInt(spl[1]);
p.setSize2(two);
}
else if (!(sp.contains("-")))
{
if (sp.contains(","))
{
String[] spl = sp.split(",");
StringBuilder build = new StringBuilder();
for (String st : spl)
{
build.append(st);
f = build.toString();
}
in = Integer.parseInt(f);
p.setSize(in);
}
else
{
p.setSize(Integer.parseInt(sp));
}
}
}
}
v.add(p);
p = new Property();
答案 0 :(得分:1)
我会使用正则表达式,以下内容应指向正确的方向:
Pattern pricePattern = Pattern.compile("Price\\s*:\\s*(£)?([0-9,.]+)");
Pattern sqFeetPattern = Pattern.compile("Size\\s*:\\s*([0-9,.]+)\\s*Sq");
Pattern acresPattern = Pattern.compile("Size\\s*:\\s*([0-9,.]+)\\s*acres\\s*\\(([0-9,.]+)ha\\)");
NumberFormat nf = NumberFormat.getNumberInstance();
nf.setGroupingUsed(true);
BufferedReader r = new BufferedReader(inputFileReader);
String line;
while ((line = r.readLine()) != null) {
Matcher m = pricePattern.matcher(line);
if (m.find()) {
int price = nf.parse(m.group(2)).intValue();
System.out.println("Price: " + price);
}
m = sqFeetPattern.matcher(line);
if (m.find()) {
int sqFeet = nf.parse(m.group(1)).intValue();
System.out.println("Sq Feet: " + sqFeet);
}
m = acresPattern.matcher(line);
if (m.find()) {
float acres = nf.parse(m.group(1)).floatValue();
float ha = nf.parse(m.group(2)).floatValue();
System.out.println("Acres: " + acres + " ha: " + ha);
}
}
N.B。 inputFileReader
将被定义为FileReader
或类似于获取您的文件。
答案 1 :(得分:0)
我将采取的方法是。
£
)转换为等效文本字符并过滤掉HTML标记(<p>
等)对于第2步,我正在考虑这样的事情。因此,在将字符串拆分为字段分隔符之前,将所有html标记从字符串中删除(|)