Simulate
Synthetic phenotype dataframe generation utilities for local testing and examples.
The code blocks on this page show copy-paste-safe call patterns. In the actual Python signature, some parameters are keyword-only.
simulate_phenotype_df()
phenofhy.simulate.simulate_phenotype_df(sample=1000, fields=None,
include_nonresponse=False, missing_rate=None, seed=None)Simulate an OFH-like phenotype dataframe from metadata dictionaries and coding domains.
In the implementation, include_nonresponse, missing_rate, and seed are keyword-only parameters.
If fields is None, the function uses DEFAULT_FIELDS. You can also request other valid entity.field names present in beta/helpers/data_dictionary.csv. For non-default fields, realism depends on the available metadata: coded fields use values from codings.csv, numeric fields without a coding domain use conservative fallback ranges, and unrecognized dtypes return all-missing values.
When fields is None, the default columns are:
participant.pidparticipant.birth_yearparticipant.birth_monthparticipant.registration_yearparticipant.registration_monthparticipant.demog_sex_1_1participant.demog_sex_2_1participant.demog_ethnicity_1_1questionnaire.demog_height_1_1questionnaire.smoke_status_2_1clinic_measurements.waistclinic_measurements.heightclinic_measurements.weightquestionnaire.alcohol_curr_1_1clinic_measurements.heart_first_rate
Parameters
sample: int
Number of rows to generate. Must be >= 1.
fields: str | Sequence[str] | None
Single field name, sequence of field names, or None to use DEFAULT_FIELDS.
include_nonresponse: bool
If True, keep known non-response/sentinel codes in coded domains.
missing_rate: float | Mapping[str, float] | None
Global missingness probability or per-field mapping. If None, uses module defaults.
seed: int | None
Optional random seed for reproducible output.
Returns
out: pandas.DataFrame
Simulated dataframe with requested columns.
Raises
ValueError: Exception
If sample is < 1, requested fields are unknown, or missing_rate values are outside [0, 1].
Example
from phenofhy import simulate
df_default = simulate.simulate_phenotype_df()
df = simulate.simulate_phenotype_df(
sample=500,
fields=["participant.pid", "participant.birth_year", "clinic_measurements.weight"],
include_nonresponse=False,
seed=42,
)_normalize_fields()
phenofhy.simulate._normalize_fields(fields)Normalize field selection input to a non-empty list of field names.
Parameters
fields: str | Sequence[str] | None
Single field name, sequence of names, or None.
Returns
out: list[str]
Normalized field list. If fields is None, returns DEFAULT_FIELDS.
Raises
ValueError: Exception
If fields is an empty sequence.
_helpers_dir()
phenofhy.simulate._helpers_dir()Resolve the helpers directory path used by metadata loaders.
Returns
out: pathlib.Path
Absolute path to beta/helpers.
_data_dictionary()
phenofhy.simulate._data_dictionary()Load and cache metadata from data_dictionary.csv with a computed full_field column.
Returns
out: pandas.DataFrame
Data dictionary with entity, field, type, and full_field columns.
_coded_domains()
phenofhy.simulate._coded_domains(include_nonresponse)Load and cache coded value domains from codings.csv.
Parameters
include_nonresponse: bool
Whether to keep known non-response/sentinel codes.
Returns
out: dict[str, list[float]]
Mapping of full field names to available numeric code domains.
Raises
ValueError: Exception
If codings.csv does not contain either coding_name or field.
_filter_nonresponse_codes()
phenofhy.simulate._filter_nonresponse_codes(field, codes, include_nonresponse)Filter out known non-response codes for a coded field when requested.
Parameters
field: str
Full field name (entity.field) used for field-specific exclusions.
codes: Sequence[float]
Candidate numeric codes.
include_nonresponse: bool
If True, returns codes unchanged.
Returns
out: list[float]
Filtered code list with known sentinel values removed when include_nonresponse is False.
_simulate_column()
phenofhy.simulate._simulate_column(field, dtype, sample, rng, coded_domains)Generate a synthetic column for one field using coded domains or dtype-specific fallbacks.
In the implementation, all parameters are keyword-only.
Parameters
field: str
Full field name (entity.field).
dtype: str
Field dtype string (integer, float, string, date, datetime, or other).
sample: int
Number of rows to generate.
rng: numpy.random.Generator
Random number generator instance.
coded_domains: Mapping[str, list[float]]
Per-field coded domains loaded from metadata.
Returns
out: pandas.Series
Simulated column values. Unknown dtypes return an all-NaN series.
_apply_missingness()
phenofhy.simulate._apply_missingness(col, field, missing_rate, rng)Apply field-level missingness to a simulated column.
In the implementation, field, missing_rate, and rng are keyword-only parameters.
Parameters
col: pandas.Series
Input series to mask with missing values.
field: str
Full field name used to resolve missingness probability.
missing_rate: float | Mapping[str, float] | None
Global or per-field missingness specification.
rng: numpy.random.Generator
Random number generator instance.
Returns
out: pandas.Series
Series with missing entries injected (NaT for datetime-like, NaN otherwise).
Raises
ValueError: Exception
If resolved missing probability is greater than 1.
_resolve_missing_rate()
phenofhy.simulate._resolve_missing_rate(field, missing_rate)Resolve missingness probability for one field from user input and defaults.
Parameters
field: str
Full field name (entity.field).
missing_rate: float | Mapping[str, float] | None
Global value, per-field mapping, or None.
Returns
out: float
Resolved missingness probability in [0, 1].
Raises
ValueError: Exception
If resolved value is outside [0, 1].
_random_token()
phenofhy.simulate._random_token(rng, size)Generate an uppercase alphanumeric random token.
Parameters
rng: numpy.random.Generator
Random number generator instance.
size: int
Token length.
Returns
out: str
Random token using A-Z and 0-9.