🔥 Data Loading with WandB Artifacts 🪄🐝¶
This notebook demonstrates the usage of a simple and easy-to use data loading API built on top of Tensorflow Datasets and WandB Artifacts.
Loading the Dataset¶
Now that the dataset is uploaded as an artifact with the builder logic, loading and ingesting the dataset is incredibly easy. To do this we would simply use the wandb_addons.dataset.load_dataset
function.
import tensorflow as tf
import matplotlib.pyplot as plt
from wandb_addons.dataset import load_dataset
2023-04-30 12:34:52.474435: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
Just pass in the artifact address of the dataset artifact, and we are all set. For detailed documentation of all parameters and option, refer to this page.
Note: For loading and ingesting the dataset from a wandb artifact, its not compulsory to initialize a run. However, loading inside the context of a run has added advantages of tracking lineage of artifacts and ease of versioning.
datasets, dataset_builder_info = load_dataset(
"geekyrakshit/artifact-accessor/monkey_dataset:v1", quiet=True
)
wandb: Downloading large artifact monkey_dataset:v1, 553.56MB. 8 files... wandb: 8 of 8 files downloaded. Done. 0:0:0.9 wandb: Building dataset for split: train... 2023-04-30 12:34:56.083939: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. wandb: Built dataset for split: train, num_shards: 4, num_examples: 1096 wandb: Building dataset for split: val... wandb: Built dataset for split: val, num_shards: 1, num_examples: 272
Now we that we have created the TensorFlow datasets corresponding to the splits along with the general info of the dataset, we can verify them.
class_names = dataset_builder_info.features["label"].names
sample = next(iter(datasets["train"]))
plt.imshow(sample["image"].numpy())
label_name = class_names[sample["label"].numpy()]
plt.title(f"Label: {label_name}")
plt.show()
sample = next(iter(datasets["val"]))
plt.imshow(sample["image"].numpy())
label_name = class_names[sample["label"].numpy()]
plt.title(f"Label: {label_name}")
plt.show()
print("Train Dataset Cardinality:", tf.data.experimental.cardinality(datasets["train"]).numpy())
print("Validation Dataset Cardinality:", tf.data.experimental.cardinality(datasets["val"]).numpy())
Train Dataset Cardinality: 1096 Validation Dataset Cardinality: 272
Now that we have verified the dataset splits, we can use them to build high-performance input pipelines for our training workflows not onlt in TensorFlow, but also JAX and PyTorch. You can refer to the following docs regarding building input pipelines: