我在子目录the_files
<div class='log'>start</div>
<div class='ts'>2017-03-14 09:17:52.859 +0800 </div><div class='log'>bla bla bla</div>
<div class='ts'>2017-03-14 09:17:55.619 +0800 </div><div class='log'>aba aba aba</div>
...
...
我想在每个标签中提取字符串并在终端
上打印出来2017-03-14 09:17:52.859 +0800 , bla bla bla
2017-03-14 09:17:55.619 +0800 , aba aba aba
...
...
我想忽略<div class='log'>start</div>
的第一行。
到目前为止我的代码
from bs4 import BeautifulSoup
path = "the_files/"
def do_task_html():
dir_path = os.listdir(path)
for file in dir_path:
if file.endswith(".html"):
soup = BeautifulSoup(open(path+file))
item1 = [element.text for element in soup.find_all("div", "ts")]
string1 = ''.join(item1)
item2 = [element.text for element in soup.find_all("div", "log")]
string2 = ''.join(item2)
print string1 + "," + string2
此代码生成结果如下
2017-03-14 09:17:52.859 +0800 2017-03-14 09:17:55.619 +0800 , start bla bla bla aba aba aba ... ...
有没有办法解决这个问题?
感谢您的帮助。
答案 0 :(得分:2)
按类获取每个div获取其文本及其next_sibling
文本。
for div in soup.find_all("div", class_="ts"):
print ("%s, %s") % (div.get_text(strip=True), div.next_sibling.get_text(strip=True))
输出:
2017-03-14 09:17:52.859 +0800, bla bla bla
2017-03-14 09:17:55.619 +0800, aba aba aba