Dataset Types
====================================================================
.. note::
Refer to `./examples/datasets/ `_ for examples on pre-processing
common dataset formats to conform to the Rafiki's own dataset formats.
.. _`dataset-type:IMAGE_FILES`:
IMAGE_FILES
--------------------------------------------------------------------
The dataset file must be of the ``.zip`` archive format with a ``images.csv`` at the root of the directory.
The ``images.csv`` should be of a `.CSV `_
format with columns of ``path`` and ``N`` other variable column names (*tag columns*).
For each row,
``path`` should be a file path to a ``.png``, ``.jpg`` or ``.jpeg`` image file within the archive,
relative to the root of the directory.
The other ``N`` columns describe the corresponding image, *depending on the task*.
.. _`dataset-type:CORPUS`:
CORPUS
--------------------------------------------------------------------
The dataset file must be of the ``.zip`` archive format with a ``corpus.tsv`` at the root of the directory.
The ``corpus.tsv`` should be of a `.TSV `_
format with columns of ``token`` and ``N`` other variable column names (*tag columns*).
For each row,
``token`` should be a string, a token (e.g. word) in the corpus.
These tokens should appear in the order as it is in the text of the corpus.
To delimit sentences, ``token`` can be take the value of ``\n``.
The other ``N`` columns describe the corresponding token as part of the text of the corpus, *depending on the task*.
.. _`dataset-type:TABULAR`:
TABULAR
--------------------------------------------------------------------
The dataset file must be a tabular dataset of the ``.csv`` format with ``N`` columns.
.. _`dataset-type:AUDIO_FILES`:
AUDIO_FILES
--------------------------------------------------------------------
The dataset file must be of the ``.zip`` archive format with a ``audios.csv`` at the root of the directory.