我想打开这样的网页文档:
directory = "/Path/to/file/"
with open(directory+"test.pages") as file:
data = f.readlines()
for line in data:
words = line.split()
print words
然后我收到了这个错误:
IOError: [Errno 21] Is a directory: '/path/to/file/test.pages'
为什么这是一个目录? 那我怎么打开呢?
答案 0 :(得分:1)
'/path/to/file/test.pages'
是文件系统上的目录,因此无法在Python中打开。您的操作系统正在该目录中捆绑多个文件,并可能将其作为单个包呈现。你可以想象地走一下目录并获取内容:
for root, dirs, files in os.walk('/path/to/file/test.pages'):
for file in files:
print os.path.join(root, file)
但是打开文件并尝试阅读其内容可能会毫无结果。
我将向您展示如何尝试查找任何纯文本:
import re
# use a pattern that matches for any letter A-Z, upper and lower, 0-9, and _
pattern = re.compile(r'.*\w+.*')
for root, dirs, files in os.walk('/path/to/file/test.pages'):
for file in files:
# open each file with the context manager so it's automatically closed
# regardless if there's an error. Use the Universal Newlines (U) flag too
# as a best practice (Unix, Linux, and MS have different newlines).
with open(os.path.join(root, file), 'rU') as f:
for line in f:
if re.match(pattern, line):
print line
答案 1 :(得分:0)
我有一个OSX 10.9.3的Macbook Pro。
我使用了你的代码,我没有你引用的问题。由于您将打开.pages
文件,因此您需要解码该文件:
File "/Library/Frameworks/Python.framework/Versions/3.3/lib/python3.3/encodings/ascii.py", line 26, in decode
return codecs.ascii_decode(input, self.errors)[0]
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 10: ordinal not in range(128)