Dataset Types¶
Note
Refer to ./examples/datasets/ for examples on pre-processing common dataset formats to conform to the Rafiki’s own dataset formats.
IMAGE_FILES¶
The dataset file must be of the .zip archive format with a images.csv at the root of the directory.
The images.csv should be of a .CSV
format with columns of path and N other variable column names (tag columns).
For each row,
pathshould be a file path to a.png,.jpgor.jpegimage file within the archive, relative to the root of the directory.The other
Ncolumns describe the corresponding image, depending on the task.
CORPUS¶
The dataset file must be of the .zip archive format with a corpus.tsv at the root of the directory.
The corpus.tsv should be of a .TSV
format with columns of token and N other variable column names (tag columns).
For each row,
tokenshould be a string, a token (e.g. word) in the corpus. These tokens should appear in the order as it is in the text of the corpus. To delimit sentences,tokencan be take the value of\n.The other
Ncolumns describe the corresponding token as part of the text of the corpus, depending on the task.
TABULAR¶
The dataset file must be a tabular dataset of the .csv format with N columns.
AUDIO_FILES¶
The dataset file must be of the .zip archive format with a audios.csv at the root of the directory.