Skip to content

Extract

Extraction helpers for DNAnexus datasets, including SQL export.

fields()

python
phenofhy.extract.fields(*fields, output_file=None, sep=",", sql_only=False,
	cohort_record_id=None, sql_file=None, max_rows=None, sanitize=True,
	replace_spaces=True, lower=True, return_sql=False)

Extract raw phenotype values to CSV, or generate SQL if sql_only is True.

Parameters

  *fields: str
    One or more fully-qualified field names (e.g., "participant.birth_year").
  output_file: str | None
    Output CSV path. If None and sql_only is False, returns a DataFrame.
  sep: str
    CSV delimiter for output_file.
  sql_only: bool
    If True, only generate SQL and do not execute.
  cohort_record_id: str | None
    Optional cohort record ID to target a cohort dataset.
  sql_file: str | None
    Optional file path to write SQL when sql_only is True or return_sql is True.
  max_rows: int | None
    Optional row limit for extraction.
  sanitize: bool
    Whether to sanitize column names and values.
  replace_spaces: bool
    Replace spaces in column names when sanitize is True.
  lower: bool
    Lowercase column names when sanitize is True.
  return_sql: bool
    If True, return SQL string alongside extracted data.

Returns

  out: pandas.DataFrame | str | tuple
    DataFrame when executed, SQL string when sql_only, or a tuple when return_sql is True.

Example

python
from phenofhy import extract

extract.fields(
    output_file="outputs/raw/phenos.csv",
    fields=["participant.birth_year", "participant.birth_month"],
)

run_extraction()

python
phenofhy.extract.run_extraction(fields, *, output_file=None, sep=",", sql_only=False,
	cohort_record_id=None, sql_file=None, max_rows=None, sanitize=True,
	replace_spaces=True, lower=True, return_sql=False)

Lower-level extraction entry point used by fields().

Parameters

  fields: list[str]
    List of fully-qualified field names.
  output_file: str | None
    Output CSV path. If None and sql_only is False, returns a DataFrame.
  sep: str
    CSV delimiter for output_file.
  sql_only: bool
    If True, only generate SQL and do not execute.
  cohort_record_id: str | None
    Optional cohort record ID to target a cohort dataset.
  sql_file: str | None
    Optional file path to write SQL when sql_only is True or return_sql is True.
  max_rows: int | None
    Optional row limit for extraction.
  sanitize: bool
    Whether to sanitize column names and values.
  replace_spaces: bool
    Replace spaces in column names when sanitize is True.
  lower: bool
    Lowercase column names when sanitize is True.
  return_sql: bool
    If True, return SQL string alongside extracted data.

Returns

  out: pandas.DataFrame | str | tuple
    DataFrame when executed, SQL string when sql_only, or a tuple when return_sql is True.

sql_to_pandas()

python
phenofhy.extract.sql_to_pandas(sql, *, cohort_record_id=None,
	sanitize=True, replace_spaces=True, lower=True)

Execute SQL and return a pandas DataFrame.

Parameters

  sql: str
    SQL query string.   cohort_record_id: str | None
    Optional cohort record ID to target a cohort dataset.
  sanitize: bool
    Whether to sanitize column names and values.
  replace_spaces: bool
    Replace spaces in column names when sanitize is True.
  lower: bool
    Lowercase column names when sanitize is True.

Returns

  out: pandas.DataFrame
    Query results as a DataFrame.

sql_to_spark()

python
phenofhy.extract.sql_to_spark(sql, *, cohort_record_id=None,
	sanitize=True, replace_spaces=True, lower=True)

Execute SQL and return a Spark DataFrame.

Parameters

  sql: str
    SQL query string.   cohort_record_id: str | None
    Optional cohort record ID to target a cohort dataset.
  sanitize: bool
    Whether to sanitize column names and values.
  replace_spaces: bool
    Replace spaces in column names when sanitize is True.
  lower: bool
    Lowercase column names when sanitize is True.

Returns

  out: pyspark.sql.DataFrame
    Query results as a Spark DataFrame.