Question

这不是很漂亮的代码，但我有一些代码可以从HTML文件中获取一系列字符串，并为我提供了一系列字符串：author，title，{{ 1}}，date，length。我有2000多个html文件，我想要浏览所有这些文件并将这些数据写入单个csv文件。我知道所有这些都必须最终包含在text循环中，但在此之前，我很难理解如何从获取这些值到将它们写入csv文件。我的想法是首先创建一个列表或元组，然后将其写入csv文件中的一行：

for

我不能为我的生活弄清楚如何让Python尊重这些字符串的事实，并且应该存储为字符串而不是字母列表。（上面的the_file = "/Users/john/Code/tedtalks/test/transcript?language=en.0" holding = soup(open(the_file).read(), "lxml") at = holding.find("title").text author = at[0:at.find(':')] title = at[at.find(":")+1 : at.find("|") ] date = re.sub('[^a-zA-Z0-9]',' ', holding.select_one("span.meta__val").text) length_data = holding.find_all('data', {'class' : 'talk-transcript__para__time'}) (m, s) = ([x.get_text().strip("\n\r") for x in length_data if re.search(r"(?s)\d{2}:\d{2}", x.get_text().strip("\n\r"))][-1]).split(':') length = int(m) * 60 + int(s) firstpass = re.sub(r'\([^)]*\)', '', holding.find('div', class_ = 'talk-transcript__body').text) text = re.sub('[^a-zA-Z\.\']',' ', firstpass) data = ([author].join() + [title] + [date] + [length] + [text]) with open("./output.csv", "w") as csv_file: writer = csv.writer(csv_file, delimiter=',') for line in data: writer.writerow(line)是我试图解决这个问题。）

展望未来：以这种方式处理2000个文件更好/更有效率，将它们剥离到我想要的内容并一次编写一行CSV或者在{{1}中构建数据帧更好然后将其写入CSV？（所有2000个文件= 160MB，因此剥离，最终数据不能超过100MB，所以这里没有大尺寸，但期待尺寸可能最终成为一个问题。）

Answer 1

这将获取所有文件并将数据放入csv，您只需将路径传递给包含html文件的文件夹和输出文件的名称：

import re
import csv
import os
from bs4 import BeautifulSoup
from glob import iglob


def parse(soup):
    # both title and author are can be parsed in separate tags.
    author = soup.select_one("h4.h12.talk-link__speaker").text
    title = soup.select_one("h4.h9.m5").text
    # just need to strip the text from the date string, no regex needed.
    date = soup.select_one("span.meta__val").text.strip()
    # we want the last time which is the talk-transcript__para__time previous to the footer.
    mn, sec = map(int, soup.select_one("footer.footer").find_previous("data", {
        "class": "talk-transcript__para__time"}).text.split(":"))
    length = (mn * 60 + sec)
    # to ignore time etc.. we can just pull from the actual text fragment and remove noise i.e (Applause).
    text = re.sub(r'\([^)]*\)',"", " ".join(d.text for d in soup.select("span.talk-transcript__fragment")))
    return author.strip(), title.strip(), date, length, re.sub('[^a-zA-Z\.\']', ' ', text)

def to_csv(patt, out):
    # open file to write to.
    with open(out, "w") as out:
        # create csv.writer.
        wr = csv.writer(out)
        # write our headers.
        wr.writerow(["author", "title", "date", "length", "text"])
        # get all our html files.
        for html in iglob(patt):
            with open(html, as f:
                # parse the file are write the data to a row.
                wr.writerow(parse(BeautifulSoup(f, "lxml")))

to_csv("./test/*.html","output.csv")

将一系列字符串（加上一个数字）写入一行csv

1 个答案: