Extract
Extraction helpers for DNAnexus datasets, including SQL export.
fields()
phenofhy.extract.fields(*fields, output_file=None, sep=",", sql_only=False,
cohort_record_id=None, sql_file=None, max_rows=None, sanitize=True,
replace_spaces=True, lower=True, return_sql=False)Extract raw phenotype values to CSV, or generate SQL if sql_only is True.
Parameters
*fields: str
One or more fully-qualified field names (e.g., "participant.birth_year").
output_file: str | None
Output CSV path. If None and sql_only is False, returns a DataFrame.
sep: str
CSV delimiter for output_file.
sql_only: bool
If True, only generate SQL and do not execute.
cohort_record_id: str | None
Optional cohort record ID to target a cohort dataset.
sql_file: str | None
Optional file path to write SQL when sql_only is True or return_sql is True.
max_rows: int | None
Optional row limit for extraction.
sanitize: bool
Whether to sanitize column names and values.
replace_spaces: bool
Replace spaces in column names when sanitize is True.
lower: bool
Lowercase column names when sanitize is True.
return_sql: bool
If True, return SQL string alongside extracted data.
Returns
out: pandas.DataFrame | str | tuple
DataFrame when executed, SQL string when sql_only, or a tuple when return_sql is True.
Example
from phenofhy import extract
extract.fields(
output_file="outputs/raw/phenos.csv",
fields=["participant.birth_year", "participant.birth_month"],
)run_extraction()
phenofhy.extract.run_extraction(fields, *, output_file=None, sep=",", sql_only=False,
cohort_record_id=None, sql_file=None, max_rows=None, sanitize=True,
replace_spaces=True, lower=True, return_sql=False)Lower-level extraction entry point used by fields().
Parameters
fields: list[str]
List of fully-qualified field names.
output_file: str | None
Output CSV path. If None and sql_only is False, returns a DataFrame.
sep: str
CSV delimiter for output_file.
sql_only: bool
If True, only generate SQL and do not execute.
cohort_record_id: str | None
Optional cohort record ID to target a cohort dataset.
sql_file: str | None
Optional file path to write SQL when sql_only is True or return_sql is True.
max_rows: int | None
Optional row limit for extraction.
sanitize: bool
Whether to sanitize column names and values.
replace_spaces: bool
Replace spaces in column names when sanitize is True.
lower: bool
Lowercase column names when sanitize is True.
return_sql: bool
If True, return SQL string alongside extracted data.
Returns
out: pandas.DataFrame | str | tuple
DataFrame when executed, SQL string when sql_only, or a tuple when return_sql is True.
sql_to_pandas()
phenofhy.extract.sql_to_pandas(sql, *, cohort_record_id=None,
sanitize=True, replace_spaces=True, lower=True)Execute SQL and return a pandas DataFrame.
Parameters
sql: str
SQL query string. cohort_record_id: str | None
Optional cohort record ID to target a cohort dataset.
sanitize: bool
Whether to sanitize column names and values.
replace_spaces: bool
Replace spaces in column names when sanitize is True.
lower: bool
Lowercase column names when sanitize is True.
Returns
out: pandas.DataFrame
Query results as a DataFrame.
sql_to_spark()
phenofhy.extract.sql_to_spark(sql, *, cohort_record_id=None,
sanitize=True, replace_spaces=True, lower=True)Execute SQL and return a Spark DataFrame.
Parameters
sql: str
SQL query string. cohort_record_id: str | None
Optional cohort record ID to target a cohort dataset.
sanitize: bool
Whether to sanitize column names and values.
replace_spaces: bool
Replace spaces in column names when sanitize is True.
lower: bool
Lowercase column names when sanitize is True.
Returns
out: pyspark.sql.DataFrame
Query results as a Spark DataFrame.