Dataset configuration guide
Dataset Configuration Guide
This guide explains the structure and fields used in YAML configuration files for datasets within the Reward Kit. These configurations are typically located in conf/dataset/
or within an example’s conf/dataset/
directory (e.g., examples/math_example/conf/dataset/
). They are processed by reward_kit.datasets.loader.py
using Hydra.
There are two main types of dataset configurations: Base Datasets and Derived Datasets.
1. Base Dataset Configuration
A base dataset configuration defines the connection to a raw data source and performs initial processing like column mapping.
Example File: conf/dataset/base_dataset.yaml
(schema), examples/math_example/conf/dataset/gsm8k.yaml
(concrete example)
Key Fields:
-
_target_
(Required)- Description: Specifies the Python function to instantiate for loading this dataset.
- Typical Value:
reward_kit.datasets.loader.load_and_process_dataset
- Example:
_target_: reward_kit.datasets.loader.load_and_process_dataset
-
source_type
(Required)- Description: Defines the type of the data source.
- Supported Values:
"huggingface"
: For datasets hosted on the Hugging Face Hub."jsonl"
: For local datasets in JSON Lines format."fireworks"
: (Not yet implemented) For datasets hosted on Fireworks AI.
- Example:
source_type: huggingface
-
path_or_name
(Required)- Description: Identifier for the dataset.
- For
huggingface
: The Hugging Face dataset name (e.g.,"gsm8k"
,"cais/mmlu"
). - For
jsonl
: Path to the.jsonl
file (e.g.,"data/my_data.jsonl"
).
- For
- Example:
path_or_name: "gsm8k"
- Description: Identifier for the dataset.
-
split
(Optional)- Description: Specifies the dataset split to load (e.g.,
"train"
,"test"
,"validation"
). If loading a Hugging FaceDatasetDict
or multiple JSONL files mapped viadata_files
, this selects the split after loading. - Default:
"train"
- Example:
split: "test"
- Description: Specifies the dataset split to load (e.g.,
-
config_name
(Optional)- Description: For Hugging Face datasets with multiple configurations (e.g.,
"main"
,"all"
forgsm8k
). Corresponds to thename
parameter in Hugging Face’sload_dataset
. - Default:
null
- Example:
config_name: "main"
(forgsm8k
)
- Description: For Hugging Face datasets with multiple configurations (e.g.,
-
data_files
(Optional)- Description: Used for loading local files (like JSONL, CSV) with Hugging Face’s
datasets.load_dataset
. Can be a single file path, a list, or a dictionary mapping split names to file paths. - Example:
data_files: {"train": "path/to/train.jsonl", "test": "path/to/test.jsonl"}
- Description: Used for loading local files (like JSONL, CSV) with Hugging Face’s
-
max_samples
(Optional)- Description: Maximum number of samples to load from the dataset (or from each split if a
DatasetDict
is loaded). Ifnull
or0
, all samples are loaded. - Default:
null
- Example:
max_samples: 100
- Description: Maximum number of samples to load from the dataset (or from each split if a
-
column_mapping
(Optional)- Description: A dictionary to rename columns from the source dataset to a standard internal format. Keys are the new standard names (e.g.,
"query"
,"ground_truth"
), and values are the original column names in the source dataset. This mapping is applied byreward_kit.datasets.loader.py
. - Default:
{"query": "query", "ground_truth": "ground_truth", "solution": null}
- Example (
gsm8k.yaml
):
- Description: A dictionary to rename columns from the source dataset to a standard internal format. Keys are the new standard names (e.g.,
-
preprocessing_steps
(Optional)- Description: A list of strings, where each string is a Python import path to a preprocessing function (e.g.,
"reward_kit.datasets.loader.transform_codeparrot_apps_sample"
). These functions are applied to the dataset after loading and before column mapping. - Default:
[]
- Example:
preprocessing_steps: ["my_module.my_preprocessor_func"]
- Description: A list of strings, where each string is a Python import path to a preprocessing function (e.g.,
-
hf_extra_load_params
(Optional)- Description: A dictionary of extra parameters to pass directly to Hugging Face’s
datasets.load_dataset()
(e.g.,trust_remote_code: True
). - Default:
{}
- Example:
hf_extra_load_params: {trust_remote_code: True}
- Description: A dictionary of extra parameters to pass directly to Hugging Face’s
-
description
(Optional, Metadata)- Description: A brief description of the dataset configuration for documentation purposes.
- Example:
description: "GSM8K (Grade School Math 8K) dataset."
2. Derived Dataset Configuration
A derived dataset configuration references a base dataset and applies further transformations, such as adding system prompts, changing the output format, or applying different column mappings or sample limits.
Example File: examples/math_example/conf/dataset/base_derived_dataset.yaml
(schema), examples/math_example/conf/dataset/gsm8k_math_prompts.yaml
(concrete example)
Key Fields:
-
_target_
(Required)- Description: Specifies the Python function to instantiate for loading this derived dataset.
- Typical Value:
reward_kit.datasets.loader.load_derived_dataset
- Example:
_target_: reward_kit.datasets.loader.load_derived_dataset
-
base_dataset
(Required)- Description: A reference to the base dataset configuration to derive from. This can be the name of another dataset configuration file (e.g.,
"gsm8k"
, which would loadconf/dataset/gsm8k.yaml
) or a full inline base dataset configuration object. - Example:
base_dataset: "gsm8k"
- Description: A reference to the base dataset configuration to derive from. This can be the name of another dataset configuration file (e.g.,
-
system_prompt
(Optional)- Description: A string that will be used as the system prompt. In the
evaluation_format
, this prompt is added as asystem_prompt
field alongsideuser_query
. - Default:
null
- Example (
gsm8k_math_prompts.yaml
):"Solve the following math problem. Show your work clearly. Put the final numerical answer between <answer> and </answer> tags."
- Description: A string that will be used as the system prompt. In the
-
output_format
(Optional)- Description: Specifies the final format for the derived dataset.
- Supported Values:
"evaluation_format"
: Converts dataset records to includeuser_query
,ground_truth_for_eval
, and optionallysystem_prompt
andid
. This is the standard format for many evaluation scenarios."conversation_format"
: (Not yet implemented) Converts to a list of messages."jsonl"
: Keeps records in a format suitable for direct JSONL output (typically implies minimal transformation beyond base loading and initial mapping).
- Default:
"evaluation_format"
- Example:
output_format: "evaluation_format"
-
transformations
(Optional)- Description: A list of additional transformation functions to apply after the base dataset is loaded and initial derived processing (like system prompt addition) is done. (Currently not fully implemented in
loader.py
). - Default:
[]
- Description: A list of additional transformation functions to apply after the base dataset is loaded and initial derived processing (like system prompt addition) is done. (Currently not fully implemented in
-
derived_column_mapping
(Optional)- Description: A dictionary for column mapping applied after the base dataset is loaded and before the
output_format
conversion. This can override or extend the base dataset’scolumn_mapping
. Keys are new names, values are names from the loaded base dataset. - Default:
{}
- Example (
gsm8k_math_prompts.yaml
):Note: These mapped columns (query
,ground_truth
) are then used byconvert_to_evaluation_format
to createuser_query
andground_truth_for_eval
.
- Description: A dictionary for column mapping applied after the base dataset is loaded and before the
-
derived_max_samples
(Optional)- Description: Maximum number of samples for this derived dataset. If specified, this overrides any
max_samples
from the base dataset configuration for the purpose of this derived dataset. - Default:
null
- Example:
derived_max_samples: 5
- Description: Maximum number of samples for this derived dataset. If specified, this overrides any
-
description
(Optional, Metadata)- Description: A brief description of this derived dataset configuration.
- Example:
description: "GSM8K dataset with math-specific system prompt in evaluation format."
How Configurations are Loaded
The reward_kit.datasets.loader.py
script uses Hydra to:
- Compose these YAML configurations.
- Instantiate the appropriate loader function (
load_and_process_dataset
orload_derived_dataset
) with the parameters defined in the YAML. - The loader functions then use these parameters to fetch data (e.g., from Hugging Face or local files), apply mappings, execute preprocessing steps, and format the data as requested.
This structured configuration approach allows for flexible and reproducible dataset management within the Reward Kit.