Question

我有一个数据框，其中有两列（评论和情感）。我正在使用pytorch和torchtext库预处理数据。是否可以使用数据帧作为源以torchtext读取数据？我在寻找类似但不是

的东西

data.TabularDataset.splits(path='./data')

我已经对数据执行了一些操作（干净，更改为所需的格式），最终数据在数据框中。

如果不是torchtext，那么您建议使用什么其他软件包来帮助预处理dataram中存在的文本数据。我在网上找不到任何东西。任何帮助都会很棒。

Answer 1

适应project.getProperties中的Dataset和Example类

torchtext.data

然后，如果您有两个方便的from torchtext.data import Field, Dataset, Example import pandas as pd class DataFrameDataset(Dataset): """Class for using pandas DataFrames as a datasource""" def __init__(self, examples, fields, filter_pred=None): """ Create a dataset from a pandas dataframe of examples and Fields Arguments: examples pd.DataFrame: DataFrame of examples fields {str: Field}: The Fields to use in this tuple. The string is a field name, and the Field is the associated field. filter_pred (callable or None): use only exanples for which filter_pred(example) is true, or use all examples if None. Default is None """ self.examples = examples.apply(SeriesExample.fromSeries, args=(fields,), axis=1).tolist() if filter_pred is not None: self.examples = filter(filter_pred, self.examples) self.fields = dict(fields) # Unpack field tuples for n, f in list(self.fields.items()): if isinstance(n, tuple): self.fields.update(zip(n, f)) del self.fields[n] class SeriesExample(Example): """Class to convert a pandas Series to an Example""" @classmethod def fromSeries(cls, data, fields): return cls.fromdict(data.to_dict(), fields) @classmethod def fromdict(cls, data, fields): ex = cls() for key, field in fields.items(): if key not in data: raise ValueError("Specified key {} was not found in " "the input data".format(key)) if field is not None: setattr(ex, key, field.preprocess(data[key])) else: setattr(ex, key, data[key]) return ex，train_df数据集，只需使用以下命令将它们加载到Dataset对象中：

valid_df

Answer 2

感谢杰弗里。

通过查看torchtext.data.field的源代码

https://pytorch.org/text/_modules/torchtext/data/field.html

看起来“ train”参数需要已经是一个数据集，或者是一些可迭代的文本数据源。但是鉴于目前我们还没有创建数据集，我想您只是从数据框中传入了文本列。

数据框作为torchtext中的数据源

2 个答案: