我正在执行文本汇总任务,并尝试将.csv数据集添加到tensorflow_datasets
(运行预训练的转换器需要此)。我正在关注本教程https://www.tensorflow.org/datasets/add_dataset,但仍然不知道如何添加它。
这是我到目前为止所拥有的:
import tensorflow_datasets.public_api as tfds
# TODO(data.csv): BibTeX citation
_CITATION = """
"""
_HOMEPAGE = "https:..."
# TODO(data.csv):
_DESCRIPTION = """A textual corpus of ...
"""
_DOCUMENT = "text"
_SUMMARY = "summary"
manual_dir = './'
class new_dataset(tfds.core.GeneratorBasedBuilder):
"""TODO(data.csv): Short description of my dataset."""
# TODO(data.csv): Set up version.
VERSION = tfds.core.Version('0.1.0')
def _info(self):
return tfds.core.DatasetInfo(
builder=self,
description=_DESCRIPTION,
features=tfds.features.FeaturesDict({
_DOCUMENT: tfds.features.Text(),
_SUMMARY: tfds.features.Text()
}),
supervised_keys=(_DOCUMENT, _SUMMARY),
homepage="https://...",
citation=_CITATION,
)
def _split_generators(self, dl_manager):
"""Returns SplitGenerators."""
# TODO(data.csv): Downloads the data and defines the splits
# dl_manager is a tfds.download.DownloadManager that can be used to
# download and extract URLs
return [
tfds.core.SplitGenerator(
name=tfds.Split.TRAIN,
# These kwargs will be passed to _generate_examples
gen_kwargs={},
),
]
def _generate_examples(self):
# Yields examples from the dataset
yield 'key', {}
如果我的数据集是包含两列“ .text”和“ summary”的.csv文件,如何正确定义def _split_generators
和def _generate_examples
?这个python文件和我的数据集在同一目录中。