Dataset Types¶
Note
Refer to ./examples/datasets/ for examples on pre-processing common dataset formats to conform to the Rafiki’s own dataset formats.
IMAGE_FILES¶
The dataset file must be of the .zip
archive format with a images.csv
at the root of the directory.
The images.csv
should be of a .CSV
format with columns of path
and N
other variable column names (tag columns).
For each row,
path
should be a file path to a.png
,.jpg
or.jpeg
image file within the archive, relative to the root of the directory.The other
N
columns describe the corresponding image, depending on the task.
CORPUS¶
The dataset file must be of the .zip
archive format with a corpus.tsv
at the root of the directory.
The corpus.tsv
should be of a .TSV
format with columns of token
and N
other variable column names (tag columns).
For each row,
token
should be a string, a token (e.g. word) in the corpus. These tokens should appear in the order as it is in the text of the corpus. To delimit sentences,token
can be take the value of\n
.The other
N
columns describe the corresponding token as part of the text of the corpus, depending on the task.
TABULAR¶
The dataset file must be a tabular dataset of the .csv
format with N
columns.
AUDIO_FILES¶
The dataset file must be of the .zip
archive format with a audios.csv
at the root of the directory.