使用Google Apps脚本从网页中提取数据时的字符编码问题

时间:2015-11-18 23:59:07

标签: google-apps-script character-encoding extract hebrew windows-1255

我使用Google Apps脚本编写了一个脚本,用于将网页中的文本提取到Google表格中。我只需要这个脚本来处理特定的网页,所以它不需要多功能。该脚本几乎完全按照我的要求工作,除了我遇到了字符编码问题。我正在提取希伯来语和英语文本。 HTML中的元标记具有charset = Windows-1255。英文完美地摘录,但希伯来文显示为含有问号的黑色钻石。

我发现this question表示要将数据传递到blob中,然后使用getDataAsString方法转换为另一种编码。我尝试转换为不同的编码并得到不同的结果。 UTF-8显示带有问号的黑色钻石,UTF-16显示韩文,ISO 8859-8返回错误并说它不是有效参数,原始Windows-1255显示一个希伯来字符,但是还有一堆其他乱码。 / p>

但是,我可以手动将希伯来语文本复制并粘贴到Google表格中,并且可以正确显示。

我甚至测试过直接从Google Apps脚本代码传递希伯来语:

function passHebrew() {
  return "וַיְדַבֵּר";
}

这会在Google表格上正确显示希伯来文字。

Hebrew displayed as each of the encodings I mentioned

我的代码如下:

function parseText(book, chapter) {
  //var bk = book;
  //var ch = chapter;
  var bk = '04'; //hard-coded for testing purposes
  var ch = '01'; //hard-coded for testing purposes
  var url = 'http://www.mechon-mamre.org/p/pt/pt' + bk + ch + '.htm';

  var xml = UrlFetchApp.fetch(url).getContentText();

  //I had to "fix" these xml errors for XmlService.parse(xml) below
  //to function.
  xml = xml.replace('<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">', '<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "">');
  xml = xml.replace('<LINK REL="stylesheet" HREF="p.css" TYPE="text/css">', '<LINK REL="stylesheet" HREF="p.css" TYPE="text/css"></LINK>');
  xml = xml.replace('<meta http-equiv="Content-Type" content="text/html; charset=Windows-1255">', '<meta http-equiv="Content-Type" content="text/html; charset=Windows-1255"></meta>');
  xml = xml.replace(/ALIGN=CENTER/gi, 'ALIGN="CENTER"');
  xml = xml.replace(/<BR>/gi, '<BR></BR>');
  xml = xml.replace(/class=h/gi, 'class="h"');

  //This section is the specific route to the table in the page I want
  var document = XmlService.parse(xml);
  var body = document.getRootElement().getChildren("BODY");
  var maintable = body[0].getChildren("TABLE");
  var maintablechildren = maintable[0].getChildren();

  //This creates a two-dimensional array so that I can store the Hebrew
  //in the first column and the English in the second column
  var array = new Array(maintablechildren.length);
  for (var i = 0; i < maintablechildren.length; i++) {
    array[i] = new Array(2);
  }

  //This is where the table gets parsed into the array
  for (var i = 0; i < maintablechildren.length; i++) {
    var verse = maintablechildren[i].getChildren();

    //This is where the encoding problem occurs.
    //I originally tried verse[0].getText() but it didn't work.
    array[i][0] = Utilities.newBlob(verse[0].getText()).getDataAsString('UTF-8');
    //This array receives the English text and works fine.
    array[i][1] = verse[1].getText();
  }

  return array;
}

我在忽视,误解或做错什么?我对编码的工作方式没有很好的理解,所以我不明白为什么将它转换为UTF-8不起作用。

1 个答案:

答案 0 :(得分:3)

您的问题出现在您评论为编码问题的行之前:因为UrlFetchApp的默认编码是从头开始修改unicode文本。

您应该使用.getContentText()方法的变体返回编码为给定字符集的字符串的HTTP响应的内容。对于您的情况:

var xml = UrlFetchApp.fetch(url).getContentText("Windows-1255");

尽管不再需要blob()解决方法,但这应该是您需要更改的全部内容。 (但这无害。)其他评论:

  • 逻辑OR运算符(||)对于设置默认值非常有用。我已经调整了前几行以启用测试,但仍然让函数正常运行参数。

  • 在用字符串填充之前设置空数组的方式是错误的JavaScript;它不需要复杂的代码,所以要抛弃它。相反,我们会在其上声明array数组,然后是push()行。

  • 使用更聪明的RegExp可以减少.replace()函数;我已经包含了非常棘手的演示网址。

  • 文字中有\n个换行符,我认为这些字符对于您的目的来说是不必要的,因此也为它们添加了replace()

这是你剩下的:

function parseText(book, chapter) {
  var bk = book || '04'; //hard-coded for testing purposes
  var ch = chapter || '01'; //hard-coded for testing purposes
  var url = 'http://www.mechon-mamre.org/p/pt/pt' + bk + ch + '.htm';

  var xml = UrlFetchApp.fetch(url).getContentText("Windows-1255");

  //I had to "fix" these xml errors for XmlService.parse(xml) below
  //to function.
  xml = xml.replace(/(<!DOCTYPE.*EN")>/gi, '$1 "">')
           .replace(/(<(LINK|meta).*>)/gi,'$1</$2>')        // https://regex101.com/r/nH3pU8/1
           .replace(/(<.*?=)([^"']*?)([ >])/gi,'$1"$2"$3')  // https://regex101.com/r/eP7wO7/1
           .replace(/<BR>/gi, '<BR/>')
           .replace(/\n/g, '')

  //This section is the specific route to the table in the page I want
  var document = XmlService.parse(xml);
  var body = document.getRootElement().getChildren("BODY");
  var maintable = body[0].getChildren("TABLE");
  var maintablechildren = maintable[0].getChildren();

  //This is where the table gets parsed into the array
  var array = [];
  for (var i = 0; i < maintablechildren.length; i++) {
    var verse = maintablechildren[i].getChildren();

    //I originally tried verse[0].getText() but it didn't work.** It does now!
    var hebrew = verse[0].getText();
    //This array receives the English text and works fine.
    var english = verse[1].getText();
    array.push([hebrew,english]);
  }

  return array;
}

结果

 [
  [
    "  וַיְדַבֵּר יְהוָה אֶל-מֹשֶׁה בְּמִדְבַּר סִינַי, בְּאֹהֶל מוֹעֵד:  בְּאֶחָד לַחֹדֶשׁ הַשֵּׁנִי בַּשָּׁנָה הַשֵּׁנִית, לְצֵאתָם מֵאֶרֶץ מִצְרַיִם--לֵאמֹר.",
    " And the LORD spoke unto Moses in the wilderness of Sinai, in the tent of meeting, on the first day of the second month, in the second year after they were come out of the land of Egypt, saying:"
  ],
  [
    "  שְׂאוּ, אֶת-רֹאשׁ כָּל-עֲדַת בְּנֵי-יִשְׂרָאֵל, לְמִשְׁפְּחֹתָם, לְבֵית אֲבֹתָם--בְּמִסְפַּר שֵׁמוֹת, כָּל-זָכָר לְגֻלְגְּלֹתָם.",
    " 'Take ye the sum of all the congregation of the children of Israel, by their families, by their fathers' houses, according to the number of names, every male, by their polls;"
  ],
  [
    "  מִבֶּן עֶשְׂרִים שָׁנָה וָמַעְלָה, כָּל-יֹצֵא צָבָא בְּיִשְׂרָאֵל--תִּפְקְדוּ אֹתָם לְצִבְאֹתָם, אַתָּה וְאַהֲרֹן.",
    " from twenty years old and upward, all that are able to go forth to war in Israel: ye shall number them by their hosts, even thou and Aaron."
  ],
  ...