文本分组文本

时间:2016-06-14 08:18:37

标签: grouping levenshtein-distance text-analysis

我需要帮助分组文本..我有一个这样的商家列表,我们可以看到前几个属于SMART ATT旁边的CENTURYLINK ..有没有办法用一个标签对这些文本进行分组/标记或根据他们所属的游泳池对这些文本进行分类..

提前致谢

001 CENTURYLINK IREP

003 CENTURYLINK MY帐户

003-ClearTalk Wireless

004 CENTURYLINK IVR

005 CENTURYLINK RECURRING

006 CENTURYLINK WIFI

007 CENTURYLINK CABLE

111 SMART ATT

112 SMART ATT

113 - SMART - ATT

114 SMART ATT

120 - SMART - ATT

131 - SMART - ATT

137 - SMART - ATT

无线AMERY

无线安娜

无线APTOS

无线ARCADIA

无线ARNOLDS PAR

无线的土地

无线雅典

1 个答案:

答案 0 :(得分:0)

You have a few options. Among the simplest would be to match vendor substrings, as follows:

import java.util.Arrays;
import java.util.LinkedList;
import java.util.List;
import java.util.Map;
import java.util.TreeMap;
import java.util.stream.Collectors;

public class GroupVendors {
    public static void main(final String[] args) {
        final List<String> vendors = Arrays.asList(
            "CENTURYLINK",
            "SMART",
            "ATT",
            "A WIRELESS");

        final List<String> uncategorizedVendors = Arrays.asList(
            "001 CENTURYLINK IREP",
            "003 CENTURYLINK MY ACCOUNT",
            "003-ClearTalk Wireless",
            "004 CENTURYLINK IVR",
            "005 CENTURYLINK RECURRING",
            "006 CENTURYLINK WIFI",
            "007 CENTURYLINK CABLE",
            "111 SMART ATT",
            "112 SMART ATT",
            "113 - SMART - ATT",
            "114 SMART ATT",
            "120 - SMART - ATT",
            "131 - SMART - ATT",
            "137 - SMART - ATT",
            "A WIRELESS AMERY",
            "A WIRELESS ANNA",
            "A WIRELESS APTOS",
            "A WIRELESS ARCADIA",
            "A WIRELESS ARNOLDS PAR",
            "A WIRELESS ASHLAND",
            "A WIRELESS ATHENS");

        final Map<String, List<String>> categorizedVendors = new TreeMap<>();

        for (final String vendor : vendors) {
            categorizedVendors.put(vendor, new LinkedList<String>());
        }

        for (final String vendor : uncategorizedVendors) {
            for (final Map.Entry<String, List<String>> entry : categorizedVendors.entrySet()) {
                final String category = entry.getKey();
                if (vendor.contains(category)) {
                    final List<String> bin = entry.getValue();
                    bin.add(vendor);
                }
            }
        }

        for (final Map.Entry<String, List<String>> entry : categorizedVendors.entrySet()) {
            final String category = entry.getKey();
            final List<String> bin = entry.getValue();
            System.out.printf("vendors(\"%s\") = {%n", category);
            if (!bin.isEmpty()) {
                System.out.printf("    %s%n",
                    bin.stream()
                        .map((vendor) -> String.format("\"%s\"", vendor))
                        .collect(Collectors.joining(",\n    ")));
            }
            System.out.println("}");
        }
    }
}

Sample run:

% java GroupVendors
vendors("A WIRELESS") = {
    "A WIRELESS AMERY",
    "A WIRELESS ANNA",
    "A WIRELESS APTOS",
    "A WIRELESS ARCADIA",
    "A WIRELESS ARNOLDS PAR",
    "A WIRELESS ASHLAND",
    "A WIRELESS ATHENS"
}
vendors("ATT") = {
    "111 SMART ATT",
    "112 SMART ATT",
    "113 - SMART - ATT",
    "114 SMART ATT",
    "120 - SMART - ATT",
    "131 - SMART - ATT",
    "137 - SMART - ATT"
}
vendors("CENTURYLINK") = {
    "001 CENTURYLINK IREP",
    "003 CENTURYLINK MY ACCOUNT",
    "004 CENTURYLINK IVR",
    "005 CENTURYLINK RECURRING",
    "006 CENTURYLINK WIFI",
    "007 CENTURYLINK CABLE"
}
vendors("SMART") = {
    "111 SMART ATT",
    "112 SMART ATT",
    "113 - SMART - ATT",
    "114 SMART ATT",
    "120 - SMART - ATT",
    "131 - SMART - ATT",
    "137 - SMART - ATT"
}

I've made the assumption that the list of vendor categories you are interested in is "CENTURYLINK", "SMART", "ATT", and "A WIRELESS". This has the effect of categorizing all entries containing both "SMART" and "ATT" in both their bins. If you want each vendor to be categorized in exactly one bin, then you will need to resolve which vendor you prefer when the categories are redundant.