lxml模块中etree.tostring函数的使用

python演示代码如下:

from lxml import etree
html_str = ''' <div> <ul> 
        <li class="item-1"><a href="link1.html">first item</a></li> 
        <li class="item-1"><a href="link2.html">second item</a></li> 
        <li class="item-inactive"><a href="link3.html">third item</a></li> 
        <li class="item-1"><a href="link4.html">fourth item</a></li> 
        <li class="item-0"><a href="link5.html">fifth item</a> 
        </ul> </div> '''

html = etree.HTML(html_str)

handeled_html_str = etree.tostring(html).decode()
print(handeled_html_str)

通过 python print 打印结果和原来相比:

1. 自动补全原本缺失的li标签

2. 自动补全html等标签

<html><body><div> <ul> 
<li class="item-1"><a href="link1.html">first item</a></li> 
<li class="item-1"><a href="link2.html">second item</a></li> 
<li class="item-inactive"><a href="link3.html">third item</a></li> 
<li class="item-1"><a href="link4.html">fourth item</a></li> 
<li class="item-0"><a href="link5.html">fifth item</a> 
</li></ul> </div> </body></html>

结论:

lxml.etree.HTML(html_str)可以自动补全标签

lxml.etree.tostring函数可以将转换为Element对象再转换回html字符串

爬虫如果使用lxml来提取数据,应该以lxml.etree.tostring的返回结果作为提取数据的依据