Preparing the Dataset with Builder Script or TFDS Module
-
For this example, we will be using the 10 Monkey Species dataset.
The directory structure of this dataset is as follows:
monkey_species_dataset |-- __init__.py |-- monkey_labels.txt |-- training | `-- training | |-- n0 | |-- n1 | |-- n2 | |-- n3 | |-- n4 | |-- n5 | |-- n6 | |-- n7 | |-- n8 | `-- n9 | |-- n9151jpg | `-- n9160.png `-- validation `-- validation |-- n0 |-- n1 |-- n2 |-- n3 |-- n4 |-- n5 |-- n6 |-- n7 |-- n8 `-- n9
-
Next let us setup
wandb-addons
. We can do this using the following command:This would install
wandb-addons
and also optional dependencies includingtensorflow
andtfds-nightly
, the nightly release oftensorflow-datasets
. -
Now, let us
cd
into themonkey_species_dataset
directory and initialize the tensorflow datasets template files, which would be used for interpreting and registering features from our dataset.cd monkey_species_dataset # Create `monkey_species_dataset/monkey_species` template files. tfds new monkey_species
This would create a directory with the following structure inside the directory
monkey_species_dataset
:monkey_species |-- CITATIONS.bib |-- README.md |-- TAGS.txt |-- __init__.py |-- checksums.tsv |-- dummy_data | `-- TODO-add_fake_data_in_this_directory.txt |-- monkey_species_dataset_builder.py `-- monkey_species_dataset_builder_test.py
The complete directory structure of
monkey_species_dataset
at this point is going to something like:monkey_species_dataset |-- __init__.py |-- monkey_labels.txt |-- monkey_species | |-- CITATIONS.bib | |-- README.md | |-- TAGS.txt | |-- __init__.py | |-- checksums.tsv | |-- dummy_data | | `-- TODO-add_fake_data_in_this_directory.txt | |-- monkey_species_dataset_builder.py | `-- monkey_species_dataset_builder_test.py |-- training | `-- training | |-- n0 | |-- n1 | |-- n2 | |-- n3 | |-- n4 | |-- n5 | |-- n6 | |-- n7 | |-- n8 | `-- n9 | |-- n9151jpg | `-- n9160.png `-- validation `-- validation |-- n0 |-- n1 |-- n2 |-- n3 |-- n4 |-- n5 |-- n6 |-- n7 |-- n8 `-- n9
Note
The name with which you initialize the
tfds new
command would be used as thename
of your dataset. -
Now we will write our dataset builder in the file
monkey_species_dataset/monkey_species/monkey_species_dataset_builder.py
. This logic for writing a dataset builder is exactly similar to that of creating the same for HuggingFace Datasets or a vanilla TensorFlow dataset.Note
Alternative to step 3, you could also simply inclide a builder script
<dataset_name>.py
into themonkey_species_dataset
directory, instead of creating the TFDS module.You can refer to the following examples
You can refer to the following guides for writing builder scripts
-
Now that our dataset is ready with the specifications for loading the features, we can upload it to our Weights & Biases project as an artifact using the
upload_dataset
function, which would verify if the dataset build is successful or not before uploading the dataset.import wandb from wandb_addons.dataset import upload_dataset # Initialize a W&B Run wandb.init(project="my-awesome-project", job_type="upload_dataset") # Note that we should set our dataset name as the name of the artifact upload_dataset(name="my_awesome_dataset", path="./my/dataset/path", type="dataset")
In order to load this dataset in your ML workflow you can simply use the
load_dataset
function:from wandb_addons.dataset import load_dataset datasets, dataset_builder_info = load_dataset("entity-name/my-awesome-project/my_awesome_dataset:v0")
Note
- In the
upload_dataset
function by default convert a registered dataset to TFRecords (like this artifact). You can alternatively upload the dataset in its original state along with the added TFDS module containing the builder script by simply settingupload_tfrecords
parameter toFalse
. - Note that this won't affect loading the dataset using
load_dataset
, dataset loading from artifacts would work as long as the artifact contains either the TFRecords or the original dataset with the TFDS module. - The TFRecord artifact has to follow the specification specified in this guide. However, if you're using the
upload_dataset
function, you don't need to worry about this.
You can take a look at this artifact that demonstrates the aforementioned directory structure and builder logic.
- In the