Supported Dataset Types¶
Dataset URIs must have the protocols of either http or https.
Note
You can alternatively use relative (e.g. data/dataset.zip) filepaths as dataset URIs,
only if you have deployed the full Rafiki stack on your own machine. This filepath is relative to
the root of the project directory.
Note
Refer to ./examples/datasets/ for examples on pre-processing common dataset formats to conform to the Rafiki’s own dataset formats.
IMAGE_FILES¶
The dataset file must be of the .zip archive format with a images.csv at the root of the directory.
The images.csv should be of a .CSV
format with 2 columns of path and class.
For each row,
pathshould be a file path to a.png,.jpgor.jpegimage file within the archive, relative to the root of the directory.
classshould be an integer from0tok - 1, wherekis the number of classes in the classification of images.
An example of images.csv follows:
path,class
image-0-of-class-0.png,0
image-1-of-class-0.png,0
...
image-0-of-class-1.png,1
...
image-99-of-class-9.png,9
CORPUS¶
The dataset file must be of the .zip archive format with a corpus.tsv at the root of the directory.
The corpus.tsv should be of a .TSV
format with columns of token and N other variable column names (tag columns).
For each row,
tokenshould be a string, a token (e.g. word) in the corpus. These tokens should appear in the order as it is in the text of the corpus. To delimit sentences,tokencan be take the value of\n.The other
Ncolumns should be integers from0tok_i - 1, wherek_iis the number of classes for each column. These tag columns describe the corresponding token as part of the text of the corpus, and depends on the task.
An example of corpus.tsv for POS tagging follows:
token       tag
Two         3
leading     2
...
line-item   1
veto        5
.           4
\n          0
Professors  6
Philip      6
...
previous    1
presidents  8
.           4
\n          0