ICD
ICD trait matching for pandas and Spark workflows.
match_icd_traits()
phenofhy.icd.match_icd_traits(raw_df, traits_and_codes, *, diag_cols=None,
pid_col="nhse_eng_inpat.pid", prefix_if_len_at_most=3, primary_only=False,
return_occurrence_counts=True, use_pyarrow_strings=True, chunksize=None,
diag_prefix="nhse_eng_inpat.diag_4_", extra_diag_cols=None, all_if_none=False)Match ICD traits in a pandas DataFrame.
Parameters
raw_df: pandas.DataFrame
Input dataframe with diagnosis columns.
traits_and_codes: dict | pandas.DataFrame | tuple[list, list]
Trait/code mapping as dict, DataFrame with trait/ICD_code, or aligned sequences.
diag_cols: list[str] | None
Explicit diagnosis columns to scan.
pid_col: str
Participant ID column name.
prefix_if_len_at_most: int | None
Codes at or below this length are treated as prefixes.
primary_only: bool
If True, use only the primary diagnosis column.
return_occurrence_counts: bool
If True, include occurrence counts in the summary.
use_pyarrow_strings: bool
Prefer pyarrow string dtype for faster vectorized ops.
chunksize: int | None
Optional chunk size for large dataframes.
diag_prefix: str | None
Prefix to select diagnosis columns when diag_cols is None.
extra_diag_cols: list[str] | None
Additional diagnosis columns to include when present.
all_if_none: bool
If True, use all non-pid columns when no diagnosis columns are found.
Returns
out: tuple[dict[str, set], pandas.DataFrame]
Trait-to-pids mapping and a summary DataFrame.
Example
from phenofhy import icd
trait_map = {"asthma": ["J45", "J46"]}
trait_pids, summary = icd.match_icd_traits(df, trait_map)get_matched_icd_traits()
phenofhy.icd.get_matched_icd_traits(raw_df, traits_and_codes, *, diag_cols=None,
prefix_if_len_at_most=3, uppercase=True, remove_chars=())Summarize matched ICD codes per trait (pandas path).
Parameters
raw_df: pandas.DataFrame
Input dataframe with diagnosis columns.
traits_and_codes: dict | pandas.DataFrame | tuple[list, list]
Trait/code mapping as dict, DataFrame with trait/ICD_code, or aligned sequences.
diag_cols: list[str] | None
Explicit diagnosis columns to scan.
prefix_if_len_at_most: int
Codes at or below this length are treated as prefixes.
uppercase: bool
Whether to uppercase codes before matching.
remove_chars: tuple[str, ...]
Characters to remove before matching.
Returns
out: pandas.DataFrame
DataFrame with columns trait, n_unique_codes, unique_codes.
match_icd_traits_spark()
phenofhy.icd.match_icd_traits_spark(sdf, traits_and_codes, *,
pid_col="nhse_eng_inpat.pid", diag_prefix="nhse_eng_inpat.diag_4_",
extra_diag_cols=None, primary_only=False, prefix_if_len_at_most=3,
uppercase=True, remove_chars=(), return_occurrence_counts=True,
return_pids=False, max_pids_collect=200_000)Match ICD traits in Spark with broadcasted trait codes.
Parameters
sdf: pyspark.sql.DataFrame
Spark dataframe with diagnosis columns.
traits_and_codes: dict | pandas.DataFrame | tuple[list, list]
Trait/code mapping as dict, DataFrame with trait/ICD_code, or aligned sequences.
pid_col: str
Participant ID column name.
diag_prefix: str
Prefix to select diagnosis columns.
extra_diag_cols: list[str] | None
Additional diagnosis columns to include when present.
primary_only: bool
If True, use only the primary diagnosis column.
prefix_if_len_at_most: int | None
Codes at or below this length are treated as prefixes.
uppercase: bool
Whether to uppercase codes before matching.
remove_chars: tuple[str, ...]
Characters to remove before matching.
return_occurrence_counts: bool
If True, include occurrence counts in the summary.
return_pids: bool
If True, collect pid sets for small traits only.
max_pids_collect: int
Max pid set size to collect per trait.
Returns
out: tuple[dict[str, set], pandas.DataFrame]
Trait-to-pids mapping and a summary DataFrame.
get_matched_icd_traits_spark()
phenofhy.icd.get_matched_icd_traits_spark(sdf, traits_and_codes, *,
pid_col="nhse_eng_inpat.pid", diag_prefix="nhse_eng_inpat.diag_4_",
extra_diag_cols=None, primary_only=False, prefix_if_len_at_most=3,
uppercase=True, remove_chars=())Summarize matched ICD codes per trait (Spark path).
Parameters
sdf: pyspark.sql.DataFrame
Spark dataframe with diagnosis columns.
traits_and_codes: dict | pandas.DataFrame | tuple[list, list]
Trait/code mapping as dict, DataFrame with trait/ICD_code, or aligned sequences.
pid_col: str
Participant ID column name.
diag_prefix: str
Prefix to select diagnosis columns.
extra_diag_cols: list[str] | None
Additional diagnosis columns to include when present.
primary_only: bool
If True, use only the primary diagnosis column.
prefix_if_len_at_most: int
Codes at or below this length are treated as prefixes.
uppercase: bool
Whether to uppercase codes before matching.
remove_chars: tuple[str, ...]
Characters to remove before matching.
Returns
out: pandas.DataFrame
DataFrame with columns trait, n_unique_codes, unique_codes.
match_icd_traits_any()
phenofhy.icd.match_icd_traits_any(df_or_sdf, *args, **kwargs)Dispatch to pandas or Spark matching depending on input type.
Parameters
df_or_sdf: pandas.DataFrame | pyspark.sql.DataFrame
Input dataframe or Spark dataframe.
*args: tuple
Positional arguments forwarded to the implementation.
**kwargs: dict
Keyword arguments forwarded to the implementation.
Returns
out: tuple[dict[str, set], pandas.DataFrame]
Trait-to-pids mapping and a summary DataFrame.
get_matched_icd_traits_any()
phenofhy.icd.get_matched_icd_traits_any(df_or_sdf, *args, **kwargs)Dispatch to pandas or Spark code-summary depending on input type.
Parameters
df_or_sdf: pandas.DataFrame | pyspark.sql.DataFrame
Input dataframe or Spark dataframe.
*args: tuple
Positional arguments forwarded to the implementation.
**kwargs: dict
Keyword arguments forwarded to the implementation.
Returns
out: pandas.DataFrame
DataFrame with columns trait, n_unique_codes, unique_codes.