这不是很漂亮的代码,但我有一些代码可以从HTML文件中获取一系列字符串,并为我提供了一系列字符串:author
,title
,{{ 1}},date
,length
。我有2000多个html文件,我想要浏览所有这些文件并将这些数据写入单个csv文件。我知道所有这些都必须最终包含在text
循环中,但在此之前,我很难理解如何从获取这些值到将它们写入csv文件。我的想法是首先创建一个列表或元组,然后将其写入csv文件中的一行:
for
我不能为我的生活弄清楚如何让Python尊重这些字符串的事实,并且应该存储为字符串而不是字母列表。 (上面的the_file = "/Users/john/Code/tedtalks/test/transcript?language=en.0"
holding = soup(open(the_file).read(), "lxml")
at = holding.find("title").text
author = at[0:at.find(':')]
title = at[at.find(":")+1 : at.find("|") ]
date = re.sub('[^a-zA-Z0-9]',' ', holding.select_one("span.meta__val").text)
length_data = holding.find_all('data', {'class' : 'talk-transcript__para__time'})
(m, s) = ([x.get_text().strip("\n\r")
for x in length_data if re.search(r"(?s)\d{2}:\d{2}",
x.get_text().strip("\n\r"))][-1]).split(':')
length = int(m) * 60 + int(s)
firstpass = re.sub(r'\([^)]*\)', '', holding.find('div', class_ = 'talk-transcript__body').text)
text = re.sub('[^a-zA-Z\.\']',' ', firstpass)
data = ([author].join() + [title] + [date] + [length] + [text])
with open("./output.csv", "w") as csv_file:
writer = csv.writer(csv_file, delimiter=',')
for line in data:
writer.writerow(line)
是我试图解决这个问题。)
展望未来:以这种方式处理2000个文件更好/更有效率,将它们剥离到我想要的内容并一次编写一行CSV或者在{{1}中构建数据帧更好然后将其写入CSV? (所有2000个文件= 160MB,因此剥离,最终数据不能超过100MB,所以这里没有大尺寸,但期待尺寸可能最终成为一个问题。)
答案 0 :(得分:1)
这将获取所有文件并将数据放入csv,您只需将路径传递给包含html文件的文件夹和输出文件的名称:
import re
import csv
import os
from bs4 import BeautifulSoup
from glob import iglob
def parse(soup):
# both title and author are can be parsed in separate tags.
author = soup.select_one("h4.h12.talk-link__speaker").text
title = soup.select_one("h4.h9.m5").text
# just need to strip the text from the date string, no regex needed.
date = soup.select_one("span.meta__val").text.strip()
# we want the last time which is the talk-transcript__para__time previous to the footer.
mn, sec = map(int, soup.select_one("footer.footer").find_previous("data", {
"class": "talk-transcript__para__time"}).text.split(":"))
length = (mn * 60 + sec)
# to ignore time etc.. we can just pull from the actual text fragment and remove noise i.e (Applause).
text = re.sub(r'\([^)]*\)',"", " ".join(d.text for d in soup.select("span.talk-transcript__fragment")))
return author.strip(), title.strip(), date, length, re.sub('[^a-zA-Z\.\']', ' ', text)
def to_csv(patt, out):
# open file to write to.
with open(out, "w") as out:
# create csv.writer.
wr = csv.writer(out)
# write our headers.
wr.writerow(["author", "title", "date", "length", "text"])
# get all our html files.
for html in iglob(patt):
with open(html, as f:
# parse the file are write the data to a row.
wr.writerow(parse(BeautifulSoup(f, "lxml")))
to_csv("./test/*.html","output.csv")