Skip to content

ICD

ICD trait matching for pandas and Spark workflows.

match_icd_traits()

python
phenofhy.icd.match_icd_traits(raw_df, traits_and_codes, *, diag_cols=None,
	pid_col="nhse_eng_inpat.pid", prefix_if_len_at_most=3, primary_only=False,
	return_occurrence_counts=True, use_pyarrow_strings=True, chunksize=None,
	diag_prefix="nhse_eng_inpat.diag_4_", extra_diag_cols=None, all_if_none=False)

Match ICD traits in a pandas DataFrame.

Parameters

  raw_df: pandas.DataFrame
    Input dataframe with diagnosis columns.
  traits_and_codes: dict | pandas.DataFrame | tuple[list, list]
    Trait/code mapping as dict, DataFrame with trait/ICD_code, or aligned sequences.
  diag_cols: list[str] | None
    Explicit diagnosis columns to scan.
  pid_col: str
    Participant ID column name.
  prefix_if_len_at_most: int | None
    Codes at or below this length are treated as prefixes.
  primary_only: bool
    If True, use only the primary diagnosis column.
  return_occurrence_counts: bool
    If True, include occurrence counts in the summary.
  use_pyarrow_strings: bool
    Prefer pyarrow string dtype for faster vectorized ops.
  chunksize: int | None
    Optional chunk size for large dataframes.
  diag_prefix: str | None
    Prefix to select diagnosis columns when diag_cols is None.
  extra_diag_cols: list[str] | None
    Additional diagnosis columns to include when present.
  all_if_none: bool
    If True, use all non-pid columns when no diagnosis columns are found.

Returns

  out: tuple[dict[str, set], pandas.DataFrame]
    Trait-to-pids mapping and a summary DataFrame.

Example

python
from phenofhy import icd

trait_map = {"asthma": ["J45", "J46"]}
trait_pids, summary = icd.match_icd_traits(df, trait_map)

get_matched_icd_traits()

python
phenofhy.icd.get_matched_icd_traits(raw_df, traits_and_codes, *, diag_cols=None,
	prefix_if_len_at_most=3, uppercase=True, remove_chars=())

Summarize matched ICD codes per trait (pandas path).

Parameters

  raw_df: pandas.DataFrame
    Input dataframe with diagnosis columns.
  traits_and_codes: dict | pandas.DataFrame | tuple[list, list]
    Trait/code mapping as dict, DataFrame with trait/ICD_code, or aligned sequences.
  diag_cols: list[str] | None
    Explicit diagnosis columns to scan.
  prefix_if_len_at_most: int
    Codes at or below this length are treated as prefixes.
  uppercase: bool
    Whether to uppercase codes before matching.
  remove_chars: tuple[str, ...]
    Characters to remove before matching.

Returns

  out: pandas.DataFrame
    DataFrame with columns trait, n_unique_codes, unique_codes.

match_icd_traits_spark()

python
phenofhy.icd.match_icd_traits_spark(sdf, traits_and_codes, *,
	pid_col="nhse_eng_inpat.pid", diag_prefix="nhse_eng_inpat.diag_4_",
	extra_diag_cols=None, primary_only=False, prefix_if_len_at_most=3,
	uppercase=True, remove_chars=(), return_occurrence_counts=True,
	return_pids=False, max_pids_collect=200_000)

Match ICD traits in Spark with broadcasted trait codes.

Parameters

  sdf: pyspark.sql.DataFrame
    Spark dataframe with diagnosis columns.
  traits_and_codes: dict | pandas.DataFrame | tuple[list, list]
    Trait/code mapping as dict, DataFrame with trait/ICD_code, or aligned sequences.
  pid_col: str
    Participant ID column name.
  diag_prefix: str
    Prefix to select diagnosis columns.
  extra_diag_cols: list[str] | None
    Additional diagnosis columns to include when present.
  primary_only: bool
    If True, use only the primary diagnosis column.
  prefix_if_len_at_most: int | None
    Codes at or below this length are treated as prefixes.
  uppercase: bool
    Whether to uppercase codes before matching.
  remove_chars: tuple[str, ...]
    Characters to remove before matching.
  return_occurrence_counts: bool
    If True, include occurrence counts in the summary.
  return_pids: bool
    If True, collect pid sets for small traits only.
  max_pids_collect: int
    Max pid set size to collect per trait.

Returns

  out: tuple[dict[str, set], pandas.DataFrame]
    Trait-to-pids mapping and a summary DataFrame.

get_matched_icd_traits_spark()

python
phenofhy.icd.get_matched_icd_traits_spark(sdf, traits_and_codes, *,
	pid_col="nhse_eng_inpat.pid", diag_prefix="nhse_eng_inpat.diag_4_",
	extra_diag_cols=None, primary_only=False, prefix_if_len_at_most=3,
	uppercase=True, remove_chars=())

Summarize matched ICD codes per trait (Spark path).

Parameters

  sdf: pyspark.sql.DataFrame
    Spark dataframe with diagnosis columns.
  traits_and_codes: dict | pandas.DataFrame | tuple[list, list]
    Trait/code mapping as dict, DataFrame with trait/ICD_code, or aligned sequences.
  pid_col: str
    Participant ID column name.
  diag_prefix: str
    Prefix to select diagnosis columns.
  extra_diag_cols: list[str] | None
    Additional diagnosis columns to include when present.
  primary_only: bool
    If True, use only the primary diagnosis column.
  prefix_if_len_at_most: int
    Codes at or below this length are treated as prefixes.
  uppercase: bool
    Whether to uppercase codes before matching.
  remove_chars: tuple[str, ...]
    Characters to remove before matching.

Returns

  out: pandas.DataFrame
    DataFrame with columns trait, n_unique_codes, unique_codes.

match_icd_traits_any()

python
phenofhy.icd.match_icd_traits_any(df_or_sdf, *args, **kwargs)

Dispatch to pandas or Spark matching depending on input type.

Parameters

  df_or_sdf: pandas.DataFrame | pyspark.sql.DataFrame
    Input dataframe or Spark dataframe.
  *args: tuple
    Positional arguments forwarded to the implementation.
  **kwargs: dict
    Keyword arguments forwarded to the implementation.

Returns

  out: tuple[dict[str, set], pandas.DataFrame]
    Trait-to-pids mapping and a summary DataFrame.

get_matched_icd_traits_any()

python
phenofhy.icd.get_matched_icd_traits_any(df_or_sdf, *args, **kwargs)

Dispatch to pandas or Spark code-summary depending on input type.

Parameters

  df_or_sdf: pandas.DataFrame | pyspark.sql.DataFrame
    Input dataframe or Spark dataframe.
  *args: tuple
    Positional arguments forwarded to the implementation.
  **kwargs: dict
    Keyword arguments forwarded to the implementation.

Returns

  out: pandas.DataFrame
    DataFrame with columns trait, n_unique_codes, unique_codes.