Simulate

Synthetic phenotype dataframe generation utilities for local testing and examples.

The code blocks on this page show copy-paste-safe call patterns. In the actual Python signature, some parameters are keyword-only.

simulate_phenotype_df()

python

phenofhy.simulate.simulate_phenotype_df(sample=1000, fields=None,
	include_nonresponse=False, missing_rate=None, seed=None)

Simulate an OFH-like phenotype dataframe from metadata dictionaries and coding domains.

In the implementation, include_nonresponse, missing_rate, and seed are keyword-only parameters.

If fields is None, the function uses DEFAULT_FIELDS. You can also request other valid entity.field names present in beta/helpers/data_dictionary.csv. For non-default fields, realism depends on the available metadata: coded fields use values from codings.csv, numeric fields without a coding domain use conservative fallback ranges, and unrecognized dtypes return all-missing values.

When fields is None, the default columns are:

participant.pid
participant.birth_year
participant.birth_month
participant.registration_year
participant.registration_month
participant.demog_sex_1_1
participant.demog_sex_2_1
participant.demog_ethnicity_1_1
questionnaire.demog_height_1_1
questionnaire.smoke_status_2_1
clinic_measurements.waist
clinic_measurements.height
clinic_measurements.weight
questionnaire.alcohol_curr_1_1
clinic_measurements.heart_first_rate

Parameters

  sample: int
    Number of rows to generate. Must be >= 1.
  fields: str | Sequence[str] | None
    Single field name, sequence of field names, or None to use DEFAULT_FIELDS.
  include_nonresponse: bool
    If True, keep known non-response/sentinel codes in coded domains.
  missing_rate: float | Mapping[str, float] | None
    Global missingness probability or per-field mapping. If None, uses module defaults.
  seed: int | None
    Optional random seed for reproducible output.

Returns

out: pandas.DataFrame
Simulated dataframe with requested columns.

Raises

ValueError: Exception
If sample is < 1, requested fields are unknown, or missing_rate values are outside [0, 1].

Example

python

from phenofhy import simulate

df_default = simulate.simulate_phenotype_df()

df = simulate.simulate_phenotype_df(
    sample=500,
    fields=["participant.pid", "participant.birth_year", "clinic_measurements.weight"],
    include_nonresponse=False,
    seed=42,
)

_normalize_fields()

python

phenofhy.simulate._normalize_fields(fields)

Normalize field selection input to a non-empty list of field names.

Parameters

fields: str | Sequence[str] | None
Single field name, sequence of names, or None.

Returns

out: list[str]
Normalized field list. If fields is None, returns DEFAULT_FIELDS.

Raises

ValueError: Exception
If fields is an empty sequence.

_helpers_dir()

python

phenofhy.simulate._helpers_dir()

Resolve the helpers directory path used by metadata loaders.

Returns

out: pathlib.Path
Absolute path to beta/helpers.

_data_dictionary()

python

phenofhy.simulate._data_dictionary()

Load and cache metadata from data_dictionary.csv with a computed full_field column.

Returns

out: pandas.DataFrame
Data dictionary with entity, field, type, and full_field columns.

_coded_domains()

python

phenofhy.simulate._coded_domains(include_nonresponse)

Load and cache coded value domains from codings.csv.

Parameters

include_nonresponse: bool
Whether to keep known non-response/sentinel codes.

Returns

out: dict[str, list[float]]
Mapping of full field names to available numeric code domains.

Raises

ValueError: Exception
If codings.csv does not contain either coding_name or field.

_filter_nonresponse_codes()

python

phenofhy.simulate._filter_nonresponse_codes(field, codes, include_nonresponse)

Filter out known non-response codes for a coded field when requested.

Parameters

  field: str
    Full field name (entity.field) used for field-specific exclusions.
  codes: Sequence[float]
    Candidate numeric codes.
  include_nonresponse: bool
    If True, returns codes unchanged.

Returns

out: list[float]
Filtered code list with known sentinel values removed when include_nonresponse is False.

_simulate_column()

python

phenofhy.simulate._simulate_column(field, dtype, sample, rng, coded_domains)

Generate a synthetic column for one field using coded domains or dtype-specific fallbacks.

In the implementation, all parameters are keyword-only.

Parameters

  field: str
    Full field name (entity.field).
  dtype: str
    Field dtype string (integer, float, string, date, datetime, or other).
  sample: int
    Number of rows to generate.
  rng: numpy.random.Generator
    Random number generator instance.
  coded_domains: Mapping[str, list[float]]
    Per-field coded domains loaded from metadata.

Returns

out: pandas.Series
Simulated column values. Unknown dtypes return an all-NaN series.

_apply_missingness()

python

phenofhy.simulate._apply_missingness(col, field, missing_rate, rng)

Apply field-level missingness to a simulated column.

In the implementation, field, missing_rate, and rng are keyword-only parameters.

Parameters

  col: pandas.Series
    Input series to mask with missing values.
  field: str
    Full field name used to resolve missingness probability.
  missing_rate: float | Mapping[str, float] | None
    Global or per-field missingness specification.
  rng: numpy.random.Generator
    Random number generator instance.

Returns

out: pandas.Series
Series with missing entries injected (NaT for datetime-like, NaN otherwise).

Raises

ValueError: Exception
If resolved missing probability is greater than 1.

_resolve_missing_rate()

python

phenofhy.simulate._resolve_missing_rate(field, missing_rate)

Resolve missingness probability for one field from user input and defaults.

Parameters

  field: str
    Full field name (entity.field).
  missing_rate: float | Mapping[str, float] | None
    Global value, per-field mapping, or None.

Returns

out: float
Resolved missingness probability in [0, 1].

Raises

ValueError: Exception
If resolved value is outside [0, 1].

_random_token()

python

phenofhy.simulate._random_token(rng, size)

Generate an uppercase alphanumeric random token.

Parameters

  rng: numpy.random.Generator
    Random number generator instance.
  size: int
    Token length.

Returns

out: str
Random token using A-Z and 0-9.

Simulate ​

simulate_phenotype_df() ​

_normalize_fields() ​

_helpers_dir() ​

_data_dictionary() ​

_coded_domains() ​

_filter_nonresponse_codes() ​

_simulate_column() ​

_apply_missingness() ​

_resolve_missing_rate() ​

_random_token() ​

Simulate

simulate_phenotype_df()

_normalize_fields()

_helpers_dir()

_data_dictionary()

_coded_domains()

_filter_nonresponse_codes()

_simulate_column()

_apply_missingness()

_resolve_missing_rate()

_random_token()