Supported Dataset Types

Dataset URIs must have the protocols of either http or https.

Note

You can alternatively use relative (e.g. data/dataset.zip) filepaths as dataset URIs, only if you have deployed the full Rafiki stack on your own machine. This filepath is relative to the root of the project directory.

Note

Refer to ./examples/datasets/ for examples on pre-processing common dataset formats to conform to the Rafiki’s own dataset formats.

IMAGE_FILES

The dataset file must be of the .zip archive format with a images.csv at the root of the directory.

The images.csv should be of a .CSV format with 2 columns of path and class.

For each row,

path should be a file path to a .png, .jpg or .jpeg image file within the archive, relative to the root of the directory.

class should be an integer from 0 to k - 1, where k is the number of classes in the classification of images.

An example of images.csv follows:

path,class
image-0-of-class-0.png,0
image-1-of-class-0.png,0
...
image-0-of-class-1.png,1
...
image-99-of-class-9.png,9

CORPUS

The dataset file must be of the .zip archive format with a corpus.tsv at the root of the directory.

The corpus.tsv should be of a .TSV format with columns of token and N other variable column names (tag columns).

For each row,

token should be a string, a token (e.g. word) in the corpus. These tokens should appear in the order as it is in the text of the corpus. To delimit sentences, token can be take the value of \n.

The other N columns should be integers from 0 to k_i - 1, where k_i is the number of classes for each column. These tag columns describe the corresponding token as part of the text of the corpus, and depends on the task.

An example of corpus.tsv for POS tagging follows:

token       tag
Two         3
leading     2
...
line-item   1
veto        5
.           4
\n          0
Professors  6
Philip      6
...
previous    1
presidents  8
.           4
\n          0