Question

从每个字符串行中提取术语时遇到严重问题。更具体地说，我有一个csv格式的文件，实际上不是csv格式（它只将所有术语保存到行[0]中）

所以，这里只是数千个字符串行中的示例字符串行;

test.csv

第1行：＆＃34; 31451 CID005319044 15939353 C8H14O3S2 β-硫辛酸 C1CS @ S [C @@ H] 1CCCCC（= O）O＆＃34;

第2行：＆＃34; 12232 COD05374044 23439353 C924O3S2 皂苷 CCCC（= O）O＆＃34;

第3行：＆＃34; 9048 CTD042032 23241 C3HO4O3S2 小檗碱 [C @@ H] 1CCCCC（= O）O＆＃34;

我想提取＆＃34;β-硫辛酸＆＃34; ，＆＃34; saponin＆＃34; 和＆＃34; Berberine＆＃34; 仅位于第5位。你可以看到术语之间有很大的空间，这就是为什么我说第5个位置。

在这种情况下，如何为每行提取位于第5位的术语？

还有一件事;

每六个术语之间的空白长度并不总是相等。长度可以是一，二，三或四......五......这样......

Answer 1

如果你的行[]的类型是String

String s = line[0];
String[] split = s.split("   ");
return split[4]; //which is the fifth item

对于分隔符，如果你想更精确地去，你可以使用正则表达式。

Answer 2

另一次尝试：

import java.io.File;
import java.util.Scanner;

public class HelloWorld {
    // The amount of columns per row, where each column is seperated by an arbitrary number
    //  of spaces or tabs
    final static int COLS = 7;

    public static void main(String[] args) {
        System.out.println("Tokens:");
        try (Scanner scanner = new Scanner(new File("input.txt")).useDelimiter("\\s+")) {
            // Counten the current column-id
            int n = 0;
            String tmp = "";
            StringBuilder item = new StringBuilder();
            // Operating of a stream
            while (scanner.hasNext()) {
                tmp = scanner.next();
                n += 1;
                // If we have reached the fifth column, take its content and append the
                // sixth column too, as the name we want consists of space-separated
                // expressions. Feel free to customize of your name-layout varies.
                if (n % COLS == 5) {
                    item.setLength(0);
                    item.append(tmp);
                    item.append(" ");
                    item.append(scanner.next());
                    n += 1;

                    System.out.println(item.toString()); // Doing  some stuff with that
                                                         //expression we got
                }
            }
        }
        catch(java.io.IOException e){
            System.out.println(e.getMessage());
        }
    }
}

Answer 3

列是如何分开的？例如，如果列由制表符分隔，我相信您可以使用split方法。尝试使用以下内容：

String[] parts = str.split("\\t");

您的预期结果将在parts[4]。

Answer 4

只需使用正则表达式String.split()至少2个空白字符：

String foo = "31451 　　 CID005319044 　　15939353　　 C8H14O3S2 　　　beta-lipoic acid　　 C1CS@S[C@@H]1CCCCC(=O)O";
String[] bar = foo.split("\\s\\s");
bar[4]; // beta-lipoic acid

如何从每个字符串行中提取特定术语？

4 个答案: