Question

我的数据格式可靠：

    1. New York Times - USA
    2. Guardian - UK
    3. Le Monde - France

我正在使用此代码来解析newspaper和country值：

    String newspaper = "";
    String country = "";
    int hyphenIndex = unparsedText.indexOf("-");
    if (hyphenIndex > -1)
    {
        newspaper = unparsedText.substring(0, hyphenIndex);
    }
    country = unparsedText.substring(hyphenIndex + 1, unparsedText.length());
    country = country.trim();

但这会产生以下的报纸价值：

    1. New York Times
    2. Guardian
    3. Le Monde

最简单的改变是什么才能最终得到报纸的价值：

    New York Times
    Guardian
    Le Monde

Answer 1

这是一个基于正则表达式的解决方案：

input.replaceAll("(?m)^\\d+\\.\\s*|\\s*-\\s*.*?$", "");

正则表达式适用于多行模式(?m)并删除：

前导数字后跟一个点其次是任意数量的空间。
连字符后跟任何东西。

我假设报纸名称中没有连字符。

Code In Action

Answer 2

当然只是找到第一个“。”的索引。并使用substring(from,to)将位置放在中间位置。

类似的东西：

String newspaper = "";
String country = "";
int hyphenIndex = unparsedText.indexOf("-");
int dotIndex = unparsedText.indexOf(".");
if (hyphenIndex > -1)
{
    newspaper = unparsedText.substring(dotIndex + 1, hyphenIndex);
}
country = unparsedText.substring(hyphenIndex + 1, unparsedText.length());
country = country.trim();

Answer 3

如果该格式确实可靠，那么最简单（也可能是最有效）的方法似乎是找到.字符的第一个实例，然后从dotIndex + 1开始获取子字符串。实际上，您可以将它与当前的子字符串操作（基于破折号的位置）结合起来，一次性提取报纸名称。

如果格式的可靠性稍差，您可以使用正则表达式匹配数字后跟分隔符后跟空格，然后删除它。但在这种情况下，这似乎有点矫枉过正。

Answer 4

如果所有条目都遵循您提供的格式，您可以在数字之后查找完整的句点，例如

int dotIndex = unparsedText.indexOf(".");

然后

newspaper = unparsedText.substring(dotIndex + 2, hyphenIndex - 1);

注意：您要在.之后开始2个字符，并在-之前排除1个空格或使用trim()

Answer 5

java.util.regex.Matcher m = (new java.util.regex.Pattern("[a-zA-Z ]*")).matcher(unparsedText);
m.find();
System.err.println(unparsedText.substring(m.start(), m.end());

注意＃1：假设报纸不能包含数字。

注意＃2：尚未测试。

Answer 6

如果您在.和-上分开，则

String#split(String regex)会有效。

[0] => "1"
[1] => " New York Times "
[2] => " USA"

然后只修剪你想要的结果。

Answer 7

这个正则表达式应该有效：

    Pattern pattern =  Pattern.compile("\\d+.\\s(.*)\\s-.*");
    Matcher matcher = pattern.matcher("1. New Your Times - USA");
    String newspaper = matcher.toMatchResult().group(1);
    Assert.assertEquals("New Your Times", newspaper);

Answer 8

我会这样做：

import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class Application
{
    public static void main ( final String[] args )
    {
        final String[] lines = new String[] { "1. New York Times - USA", "2. Guardian - UK", "3. Le Monde - France" };

        final Pattern p = Pattern.compile ( "\\.\\s+(.*?)\\s+-\\s+(.*)" );

        for ( final String unparsedText : lines )
        {
            String newspaper;
            String country;

            final Matcher m = p.matcher ( unparsedText );

            if ( m.find () )
            {
                newspaper = m.group ( 1 );
                country = m.group ( 2 );

                System.out.println ( "Newspaper: " + newspaper + " Country: " + country );
            }
        }
    }
}

删除多余的前导数字最简单的方法是什么？

8 个答案:

注意＃1：假设报纸不能包含数字。

注意＃2：尚未测试。