scrapy无法处理“ <”字符

时间:2019-11-07 13:59:15

标签: scrapy lxml parsel

我正在尝试提取包含“ <”(小于字符)的文本。在我的本地主机上,一切正常,在服务器上,但是包含“ <”的文字被截断。

1) hipoksemia tętnicza (PaO<sub>2</sub>/FiO<sub>2</sub> < 300 )

所以我收到:

1) hipoksemia t\u0119tnicza (PaO<sub>2</sub>/FiO<sub>2</sub>

刮>字符没有问题。谢谢您的帮助。

1 个答案:

答案 0 :(得分:0)

https://image-charts.com/chart?cht=lc &chd=t:17.7,17.7,17.6,17.5,17.4,17.3,17.3,17.2,17.1,17.3,17.6,17.4,17.2,17.1,16.9,16.8,16.8,16.7,16.6,16.5,16.4,16.4,16.3,16.3,16.2,16.2,16.1,16.1,16,16,16,16,16,15.9,15.9,15.9,15.9,15.9,15.9,15.9,15.8,15.8,15.8,16.7,16.9,16.6,16.5,16.4,16.2,16.1,16.1,16.1,16,16,16,16,16.6,16.7,16.6,16.5,16.4,16.4,16.4,16.4,16.4,16.6,17.3,17.3,17.2,17.1,17,16.8,16.8,16.7,17,16.9,16.8,16.7,16.6,16.5,16.4,16.3,16.2,16.2,16.3,16.3,16.2,16.2,16.3,16.3,16.3,16.2,16.4,16.3,16.5,16.7|66.6,64.8,62,60.9,60.5,60.7,60.6,60.4,60.4,62.9,61,60.2,60.1,60.2,60.2,60.1,60.1,60.2,60.3,60.3,60.3,60.3,60.5,60.6,60.5,60.6,60.9,60.9,60.9,61.2,61.5,61.6,61.8,62.2,62.4,62.5,62.7,62.7,62.8,62.6,63.2,63,63.1,62.3,62.4,63.9,63.8,63.2,61.8,62.9,64.8,61.5,60.1,58.3,59.7,59.8,62,62.5,59.8,58,56.9,56.3,56.3,56,55.6,59.8,63.1,56.1,53.9,52.7,52,51.4,51.8,52.4,53.2,53.1,53.4,53.8,54.3,54.6,54.9,55.2,55.5,61.7,61.6,59.2,58.8,59.4,61.8,60.2,59.8,59.3,66.7,64.2,74.4,76.3 &chds=a &chof=.png &chs=999x800 &chdl=Temperature|Humidity &chg=20,4 &chco=2196F3,FF5722 &chtt=Temperature Humidity Chart &chxt=x,y &chxl=0:|21:15||||22:15||||23:15||||00:15||||01:15||||02:15||||03:15||||04:15||||05:15||||06:15||||07:15||||08:15||||09:15||||10:15||||11:15||||12:15||||13:15||||14:15||||15:15||||16:15||||17:15||||18:15||||19:15||||20:15||||21:15 &chls=3|3 &chdlp=b &chf=bg,s,FFFFFF 是无效的HTML。应该是<

Scrapy使用Parsel来解析XML / HTML响应。 Parsel使用lxml来解析XML / HTML文档。 lxml不像Web浏览器和其他解析器那样处理损坏的HTML。

Parsel有an open issue处理这些情况。可能需要在Parsel中支持lxml的替代方案,该替代方案实施起来并非易事,因此可能需要一段时间才能解决该问题。