如何检查java中解析数据的语言

时间:2018-03-23 10:52:23

标签: java json io jsoup

我正在解析google的一项服务。这是多个langugae的结果数据。虽然我只想要英文数据。我怎样才能确保langugae。请建议。

String url = "https://newsapi.org/v2/top-headlines?sources=google-news&apiKey=89c8009165774e0fad3742f78b50c6da";

            URL url1 = new URL(url);
            URLConnection uc = url1.openConnection();

            InputStreamReader input = new InputStreamReader(uc.getInputStream());
            BufferedReader in = new BufferedReader(input);

            String inputLine;
            String fullline = "";

            while ((inputLine = in.readLine()) != null) {
                fullline = fullline.concat(inputLine);
            }

            JSONObject rootObject = new JSONObject(fullline);

            JSONArray rows1 = (JSONArray) rootObject.get("articles");

示例数据是:

    {
  "status": "ok",
  "totalResults": 1100,
  "articles": [
    {
      "source": {
        "id": null,
        "name": "Ua-football.com"
      },
      "author": "Спорт.ua",
      "title": "Шаран може покинути Олександрію після закінчення цього сезону",
      "description": "Контракт наставника закінчується влітку",
      "url": "https://www.ua-football.com/ua/ukrainian/high/1521800049-sharan-mozhe-pokinuti-oleksandriyu-pislya-zakinchennya-cogo-sezonu.html",
      "urlToImage": "https://static.ua-football.com/img/upload/18/24f537.jpeg",
      "publishedAt": "2018-03-23T10:22:21Z"
    },
    {
      "source": {
        "id": null,
        "name": "Nikkansports.com"
      },
      "author": null,
      "title": "西武菊池雄星、開幕へ万全 OP戦ラス投5回無失点",
      "description": null,
      "url": "https://www.nikkansports.com/baseball/news/201803230000734.html",
      "urlToImage": null,
      "publishedAt": "2018-03-23T10:20:46Z"
    },
    {
      "source": {
        "id": null,
        "name": "Siol.net"
      },
      "author": null,
      "title": "Picomat na Koroškem je postal prava atrakcija #video",
      "description": "Slovenj Gradec se je pred kratkim obogatil s pridobitvijo, s katero se lahko pohvalita tudi Dubaj in Dunaj. Na Koroškem je za pravo revolucijo poskrbel picomat, ki je postal pravi magnet za odrasle in mladino. Uporabniki morajo samo pritisniti na gumb in svež…",
      "url": "https://siol.net/trendi/kulinarika/picomat-na-koroskem-je-postal-prava-atrakcija-video-463111",
      "urlToImage": "https://siol.net/media/img/9a/c1/8b62129ba4efcbf0faf9-picomat.jpeg",
      "publishedAt": "2018-03-23T10:19:56Z"
    },
    {
      "source": {
        "id": null,
        "name": "Nikkansports.com"
      },
      "author": null,
      "title": "明秀日立・金沢監督「勝ちに不思議な勝ちあり」"
    }
  ]
}

3 个答案:

答案 0 :(得分:1)

您正在寻找一种识别文本语言的方法,这是一个难以解决的问题。

您很可能需要集成库或依赖第三方API。

有一些有用的链接here。您也可以使用IBM Watson API。

答案 1 :(得分:1)

使用单词频率。拿最常用的词,最好知道这些词在普通文本中的百分比,然后检查。

public boolean isEnglish(String text) {
    Set<String> mostFrequentWords = new HashSet<>();
    Collections.addAll(mostFrequentWords,
        "the", "of", "and", "a", "to", "in", "is", "be", "that", "was", "he", "for",
        "it", "with", "as", "his", "i", "on", "have", "at", "by", "not", "they",
        "this", "had", "are", "but", "from", "or", "she", "an", "which", "you", "one",
        "we", "all", "were", "her", "would", "there", "their", "will", "when", "who",
        "him", "been", "has", "more", "if", "no", "out", "do", "so", "can", "what",
        "up", "said", "about", "other", "into", "than", "its", "time", "only", "could",
        "new", "them", "man", "some", "these", "then", "two", "first", "may", "any",
        "like", "now", "my");

    int wordCount = 0;
    int hits = 0;

    Pattern wordPattern = Pattern.compile("\\b\\p{L}+\\b");
    Matcher m = wordPattern.matcher(text);
    while (m.find() && wordCount < 100) {
        String word = m.group().toLowerCase(Locale.ENGLISH);
        ++wordCount;
        if (mostFrequentWords.contains(word)) {
           ++hits;
        }
    }
    return hits * 100 / wordCount >= 30; // At least 30 percent
}

非拉丁语也可被检测为:

String ascii = text.replaceAll("\\P{ASCII}", "");
if ((text.length() - ascii.length()) * 100 / text.length() > 10) {
    return false; // More than 10% non-ASCII
}

请注意,某些内容(如引号,项目符号,短划线等逗号)不是ASCII。或者像mañanafaçade这样的借词。

答案 2 :(得分:-1)

找到解决方案。按照期望工作。

private static boolean isEnglish(String text) {
        CharsetEncoder asciiEncoder = Charset.forName("US-ASCII").newEncoder();
        CharsetEncoder isoEncoder = Charset.forName("ISO-8859-1").newEncoder();
        return  asciiEncoder.canEncode(text) || isoEncoder.canEncode(text);
    }