Question

以下部分内容来自另一个例子。它被修改了一点并用于读取HTML文件，并将内容输出到电子表格中。

因为它只是一个本地文件，使用Selenium可能是一种过度杀戮，但我只想通过这个例子来学习。

from selenium import webdriver
import lxml.html as LH
import lxml.html.clean as clean
import xlwt

book = xlwt.Workbook(encoding='utf-8', style_compression = 0)
sheet = book.add_sheet('SeaWeb', cell_overwrite_ok = True)

driver = webdriver.PhantomJS()
ignore_tags=('script','noscript','style')

results = []

driver.get("source_file.html")
content = driver.page_source
cleaner = clean.Cleaner()
content = cleaner.clean_html(content)
doc = LH.fromstring(content)

for elt in doc.iterdescendants():
    if elt.tag in ignore_tags: continue
    text = elt.text or ''                                 #question 1
    tail = elt.tail or ''                                 #question 1
    words = ''.join((text,tail)).strip()
    if words:                                   # extra question
        words = words.encode('utf-8')                     #question 2
        results.append(words)                             #question 3
        results.append('; ')                              #question 3

sheet.write (0, 0, results)

book.save("C:\\ source_output.xls")

行text=elt.text or ''和tail=elt.tail or '' - 为什么.text和.tail都有文字？为什么or ''部分在这里很重要？
HTML文件中的文本包含°（温度）等特殊字符 - .encode('utf-8')不能使其成为完美输出，无论是在IDLE还是Excel电子表格中。有什么替代方案？
是否可以将输出连接到字符串而不是列表？现在要将其附加到列表中，我必须.append两次才能将文本添加到;。

Answer 1

elt是一个html节点。它包含某些attributes和text部分。 lxml提供了使用.text或.tail取决于文本所在位置来提取所有属性和文字的方法。

<a attribute1='abc'> 
    some text     ----> .text gets this
    <p attributeP='def'> </p>
    some tail     ---> .tail gets this 
</a>

or ''背后的想法是，如果当前的html节点中没有找到text / tail，它将返回None。后来当我们想要连接/追加None类型时，它会抱怨。因此，为避免将来出现任何错误，如果text / tail为None，则使用空字符串''

Degree字符是一个单字符的unicode字符串，但当你执行.encode('utf-8')时，它变为2字节的utf-8字节字符串。这个2字节只不过是Â°或\xc3\x82\xc2\xb0。所以基本上你不必对°字符进行任何编码，Python解释器正确解释编码。如果没有，请在python脚本之上提供正确的shebang。查看PEP-0263

# -*- coding: UTF-8 -*-

是的，您也可以在字符串中加入输出，只需使用+，因为例如字符串类型没有append。

results = ''
results = results + 'whatever you want to join'

您可以保留列表并合并您的2行：

results.append(words + '; ')

注意：刚才我检查了xlwt文档，sheet.write()只接受字符串。所以基本上你不能传递results，列表类型。

Answer 2

Q1的简单示例

from lxml import etree
test = etree.XML("<main>placeholder</main>")
print test.text #prints placeholder
print test.tail #prints None
print test.tail or ''  #prints empty string

test.text = "texter"
print etree.tostring(test) #prints <main>texter</main>

test.tail = "tailer"
print etree.tostring(test) #prints <main>texter</main>tailer

Python - 将HTML文件中的内容输出到电子表格

2 个答案: