Skip to content

Process

Entity-aware processing, derivations, and data cleaning.

participant_fields()

python
phenofhy.process.participant_fields(input_data, *, derive="auto", derive_registry=None,
	coalesce_rules=None, auto_row_filters=True, age_col="derived.age_at_registration",
	min_age=18, max_age=110, floor_age=True, age_group_bins=None,
	age_group_labels=None, extra_ranges=None, extra_exprs=None,
	keep_na_in_ranges=False)

Process participant entity fields with optional derives and filters.

Parameters

  input_data: str | pandas.DataFrame
    File path or DataFrame input.
  derive: bool | list[str] | Literal["all", "auto"]
    Derivation selection policy.
  derive_registry: dict | None
    Optional custom derive registry.
  coalesce_rules: dict | None
    Optional coalesce rule overrides.
  auto_row_filters: bool
    Whether to apply age-based filtering.
  age_col: str
    Age column for filtering.
  min_age: int
    Minimum age inclusive.
  max_age: int
    Maximum age inclusive.
  floor_age: bool
    Whether to floor age values before deriving.
  age_group_bins: list[float] | None
    Optional custom age bins.
  age_group_labels: list[str] | None
    Optional custom age labels.
  extra_ranges: dict | None
    Optional extra numeric ranges for filtering.
  extra_exprs: list[str] | None
    Optional query expressions for filtering.
  keep_na_in_ranges: bool
    Whether to keep NA rows during range filters.

Returns

  out: pandas.DataFrame
    Processed DataFrame.

questionnaire_fields()

python
phenofhy.process.questionnaire_fields(input_data, *, derive="auto",
	derive_registry=None, coalesce_rules=None, auto_row_filters=False,
	age_col="derived.age_at_registration", min_age=18, max_age=110,
	floor_age=True, extra_ranges=None, extra_exprs=None, keep_na_in_ranges=False)

Process questionnaire entity fields with optional derives.

Parameters

  input_data: str | pandas.DataFrame
    File path or DataFrame input.
  derive: bool | list[str] | Literal["all", "auto"]
    Derivation selection policy.
  derive_registry: dict | None
    Optional custom derive registry.
  coalesce_rules: dict | None
    Optional coalesce rule overrides.
  auto_row_filters: bool
    Whether to apply age-based filtering.
  age_col: str
    Age column for filtering.
  min_age: int
    Minimum age inclusive.
  max_age: int
    Maximum age inclusive.
  floor_age: bool
    Whether to floor age values before deriving.
  extra_ranges: dict | None
    Optional extra numeric ranges for filtering.
  extra_exprs: list[str] | None
    Optional query expressions for filtering.
  keep_na_in_ranges: bool
    Whether to keep NA rows during range filters.

Returns

  out: pandas.DataFrame
    Processed DataFrame.

clinic_measurements_fields()

python
phenofhy.process.clinic_measurements_fields(input_data, *, derive="auto",
	derive_registry=None, coalesce_rules=None, auto_row_filters=False,
	age_col="derived.age_at_registration", min_age=18, max_age=110,
	floor_age=True, extra_ranges=None, extra_exprs=None, keep_na_in_ranges=False)

Process clinic measurements fields with optional BMI derives.

Parameters

  input_data: str | pandas.DataFrame
    File path or DataFrame input.
  derive: bool | list[str] | Literal["all", "auto"]
    Derivation selection policy.
  derive_registry: dict | None
    Optional custom derive registry.
  coalesce_rules: dict | None
    Optional coalesce rule overrides.
  auto_row_filters: bool
    Whether to apply age-based filtering.
  age_col: str
    Age column for filtering.
  min_age: int
    Minimum age inclusive.
  max_age: int
    Maximum age inclusive.
  floor_age: bool
    Whether to floor age values before deriving.
  extra_ranges: dict | None
    Optional extra numeric ranges for filtering.
  extra_exprs: list[str] | None
    Optional query expressions for filtering.
  keep_na_in_ranges: bool
    Whether to keep NA rows during range filters.

Returns

  out: pandas.DataFrame
    Processed DataFrame.

get_dummies()

python
phenofhy.process.get_dummies(df, codings_glob="./metadata/*.codings.csv",
	coding_name="MEDICAT_1_M", col="questionnaire.medicat_1_m",
	prefix="derived.medicates_", exclude_codes=(-7, -1, -3), user_map=None,
	inplace=True)

Expand a coded multi-select column into dummy variables.

Parameters

  df: pandas.DataFrame
    Input dataframe to modify.
  codings_glob: str
    Glob path to codings CSV files.
  coding_name: str
    Coding name to match in codings CSVs.
  col: str
    Column containing coded values.
  prefix: str
    Prefix for generated dummy columns.
  exclude_codes: tuple[int, ...] | None
    Codes to exclude from expansion.
  user_map: dict | None
    Optional code-to-label mapping overrides.
  inplace: bool
    If True, modify df in place.

Returns

  out: pandas.DataFrame | tuple[pandas.DataFrame, pandas.DataFrame]
    Updated dataframe (and mapping dataframe if returned by helper).

resolve_categoricals_and_labels()

python
phenofhy.process.resolve_categoricals_and_labels(df, traits, *, label_mode="labels",
	codebook_csv=None, autodetect_coded_categoricals=True, autodetect_max_levels=10)

Prepare a DataFrame and categorical trait list for summary.

Parameters

  df: pandas.DataFrame
    Input dataframe.
  traits: Iterable[str]
    Trait columns to evaluate.
  label_mode: Literal["labels", "codes"]
    Map codes to labels or reverse.
  codebook_csv: str | None
    Optional codings CSV path.
  autodetect_coded_categoricals: bool
    Whether to infer coded categoricals from numeric columns.
  autodetect_max_levels: int
    Max unique values to consider numeric as categorical.

Returns

  out: tuple[pandas.DataFrame, list[str]]
    Processed dataframe and categorical trait list.

Example

python
from phenofhy import process

df = process.participant_fields("outputs/raw/phenos.csv")
df2, cat_traits = process.resolve_categoricals_and_labels(df, traits=["derived.sex"])