Question

我正在尝试使用Python中的中文文本和大数据。部分工作是来自一些不需要的数据的干净文本。为了这个目标，我正在使用正则表达式。但是我在Python正则表达式和PyCharm应用程序中遇到了一些问题：

1）数据存储在postgresql中并在列中查看，但是，在选择并将其拉到var后，它将显示为正方形：

enter image description here 打印到控制台的值如下所示：

Mentholatum曼秀雷敦男士深层活炭洁面乳100g（新包装）

所以我认为应用程序编码没有问题，但编码的调试部分，但是，我没有找到任何解决此类行为的方法。

2）我需要关注的正则表达式的例子是删除包含它们的中文括号之间的值。我使用的代码是：

#!/usr/bin/env python
# -*- coding: utf-8 -*

import re
from pprint import pprint 
import sys, locale, os

    columnString = row[columnName]
    startFrom = valuestoremove["startsTo"]
    endWith = valuestoremove["endsAt"]
    isInclude = valuestoremove["include"]
    escapeCharsRegex = re.compile('([\.\^\$\*\+\?\(\)\[\{\|])')
    nonASCIIregex = re.compile('([^\x00-\x7F])')
    if escapeCharsRegex.match(startFrom):
        startFrom = re.escape(startFrom)
    if escapeCharsRegex.match(endWith):
        endWith = re.escape(endWith)

    if isInclude:
        regex = startFrom + '(.*)' + endWith
    else:
        regex = '(?<=' + startFrom + ').*?(?=' + endWith + ')'
    if nonASCIIregex.match(regex):
        p = re.compile(ur'' + regex)
    else:
        p = re.compile(regex)
    row[columnName] = p.sub("", columnString).strip()

但正则表达式不会影响给定的字符串。我用下一个代码进行了测试：

#!/usr/bin/env python
# -*- coding: utf-8 -*
import re

reg = re.compile(ur'（(.*)）')
string = u"巴黎欧莱雅 男士 劲能冰爽洁面啫哩（原男士劲能净爽洁面啫哩）100ml"
print string
string = reg.sub("", string)
print string

这对我来说很好。这两个代码示例之间的唯一区别是，第一个正则表达式值来自带有json的txt文件，编码为utf-8：

{
                "between": {
                    "startsTo": "(",
                    "endsAt": "）",
                    "include": true,
                    "sequenceID": "1"
                }
            }, {
                "between": {
                    "startsTo": "（",
                    "endsAt": ")",
                    "include": true,
                    "sequenceID": "2"
                }
            },{
                "between": {
                    "startsTo": "(",
                    "endsAt": ")",
                    "include": true,
                    "sequenceID": "2"
                }
            },{
                "between": {
                    "startsTo": "（",
                    "endsAt": "）",
                    "include": true,
                    "sequenceID": "2"
                }
            }

文件中的中文括号也被视为方块：

enter image description here

我无法找到针对此类行为的解释或解决方案，因此需要社区帮助

感谢您的帮助。

Answer 1

问题在于，您正在阅读的文本未被正确理解为Unicode（这是导致Python 3k彻底更改的重大问题之一）。而不是：

data_file = myfile.read()

您需要告诉它解码文件：

data_file = myfile.read().decode("utf8")

然后继续使用json.loads等，它应该可以正常工作。或者，

data = json.load(myfile, "utf8")

Answer 2

经过多次搜索和咨询，这里是中文文本（混合和非混合语言）的解决方案

import codecs
def betweencase(valuestoremove, row, columnName):
    columnString = row[columnName]
    startFrom = valuestoremove["startsTo"]
    endWith = valuestoremove["endsAt"]
    isInclude = valuestoremove["include"]
    escapeCharsRegex = re.compile('([\.\^\$\*\+\?\(\)\[\{\|])')
    if escapeCharsRegex.match(startFrom):
        startFrom = re.escape(startFrom)
    if escapeCharsRegex.match(endWith):
        endWith = re.escape(endWith)
    if isInclude:
        regex = ur'' + startFrom + '(.*)' + endWith
    else:
        regex = ur'(?<=' + startFrom + ').*?(?=' + endWith + ')'

    ***p = re.compile(codecs.encode(unicode(regex), "utf-8"))***
    delimiter = ' '
    if localization == 'CN':
        delimiter = ''

    row[columnName] = p.sub(delimiter, columnString).strip()

如您所见，我们将任何正则表达式编码为utf-8，因此postgresql db值与正则表达式匹配。

在Python中使用中文

2 个答案: