Question

我打开文本转储，然后尝试解析内容。现在我只想尝试识别文件的不同部分（标题，标签等），以便以后使用。我根据第一个字符识别行。有些行以¯（macron）开头，有些以=开头。

macron = '\xc2\xaf'
equalSign = '='
nullLines = 0

f = open(sys.argv[1])
for line in f:
    tempList = line.rsplit()
    if len(tempList) > 0:
        switchStr = tempList[0]
    else:
        print("tempList !> 0")
        nullLines = nullLines + 1
    if switchStr[0:2] == macron:
        print("macron")
    elif switchStr[0] == equalSign:
        print('equals')
    else:
        print switchStr
print(nullLines)
f.close()

此代码有效，但我很困惑。 rsplit()分割空格。如果文件中有===================这样的行，则tempList的长度为1和switchStr = '==================='。 macron也是如此。

好的，所以我尝试用switchStr[0]', but for macron, it didn't work, I needed the first "two" (but obviously just one), eg switchStr [0：2]找到每个字符串中的第一个字符。它确实适用于平等。这个翻译输出说明了我不理解的事情：

>>> line = '¯¯¯¯¯¯¯¯¯¯'
>>> line
'\xc2\xaf\xc2\xaf\xc2\xaf\xc2\xaf\xc2\xaf\xc2\xaf\xc2\xaf\xc2\xaf\xc2\xaf\xc2\xaf'
>>> print line
¯¯¯¯¯¯¯¯¯¯
>>> line = '=========='
>>> line
'=========='
>>> print line
==========
>>>

所以，一些＆＃34;字符＆＃34;需要2个字节，有些只需要一个，但我怎样才能以编程方式找出差异呢？

Answer 1

易。

Stop dealing with bytes.

with io.open(sys.argv[1], encoding='utf-8') as f:
  line = f.readline()
  print line[0]

迭代包含不同字节长度的字符的文本文件

1 个答案: