| Title: | 'BCFTools', 'libbcftools' and 'htslib' Wrappers and 'BCF'/'VCF' to 'Parquet' Convertors |
|---|---|
| Description: | Bundles the 'htslib' and 'bcftools' libraries and command lines tools for reading and manipulating VCF/BCF files. Includes streaming facilities from VCF to Apache Arrow via 'nanoarrow', enabling export to Arrow IPC format and Parquet format using 'duckdb' including a 'bcf_reader' extension. Utilities for reading and writing VCF/BCF files into 'DuckLake' are included. provided. |
| Authors: | Sounkou Mahamane Toure [aut, cre], Bonfield, James K and Marshall, John and Danecek, Petr and Li, Heng and Ohan, Valeriu and Whitwham, Andrew and Keane, Thomas Davies, Robert M, Pierre Lindenbaum [cph] (Authors of included htslib library and bcftools command line tools), Zilong Li [cph] (Author of the vcfpp library from whom makefiles and configure strategy is borrowed), Duckdb C API and extension and API authors [cph] (Authors of the duckdb extension and API used for parquet export), Giulio Genovese [cph] (Author of BCFTools munge plugin) |
| Maintainer: | Sounkou Mahamane Toure <[email protected]> |
| License: | GPL (>= 3) |
| Version: | 1.23-0.0.3.1.9001 |
| Built: | 2026-06-08 11:11:45 UTC |
| Source: | https://github.com/RGenomicsETL/RBCFTools |
Returns the path to the bundled annot-tsv executable.
annot_tsv_path()annot_tsv_path()
A character string containing the path to the annot-tsv executable.
annot_tsv_path()annot_tsv_path()
Compiles the bcf_reader extension from source using the package's htslib. Source files are copied to the build directory first.
bcf_reader_build(build_dir, force = FALSE, verbose = TRUE)bcf_reader_build(build_dir, force = FALSE, verbose = TRUE)
build_dir |
Directory where to build the extension. Source files will
be copied here and the extension will be built in |
force |
Logical, force rebuild even if extension exists |
verbose |
Logical, show build output |
Path to the built extension file
## Not run: # Build in temp directory ext_path <- bcf_reader_build(tempdir()) # Build in a specific location ext_path <- bcf_reader_build("/tmp/bcf_reader") # Force rebuild ext_path <- bcf_reader_build("/tmp/bcf_reader", force = TRUE) ## End(Not run)## Not run: # Build in temp directory ext_path <- bcf_reader_build(tempdir()) # Build in a specific location ext_path <- bcf_reader_build("/tmp/bcf_reader") # Force rebuild ext_path <- bcf_reader_build("/tmp/bcf_reader", force = TRUE) ## End(Not run)
Copies the extension source files from the package to a specified directory for building. This is necessary because the installed package directory is typically read-only.
bcf_reader_copy_source(dest_dir)bcf_reader_copy_source(dest_dir)
dest_dir |
Directory where to copy the source files. |
Invisible path to the destination directory
## Not run: # Copy to temp directory build_dir <- bcf_reader_copy_source(tempdir()) # Copy to a specific location build_dir <- bcf_reader_copy_source("/tmp/bcf_reader_build") ## End(Not run)## Not run: # Copy to temp directory build_dir <- bcf_reader_copy_source(tempdir()) # Copy to a specific location build_dir <- bcf_reader_copy_source("/tmp/bcf_reader_build") ## End(Not run)
Returns the path to the directory containing bcftools and related scripts.
bcftools_bin_dir()bcftools_bin_dir()
The directory contains the following tools:
bcftools - Main bcftools executable
color-chrs.pl - Chromosome coloring script
gff2gff - GFF conversion tool
gff2gff.py - GFF conversion Python script
guess-ploidy.py - Ploidy guessing script
plot-roh.py - ROH plotting script
plot-vcfstats - VCF statistics plotting script
roh-viz - ROH visualization tool
run-roh.pl - ROH analysis script
vcfutils.pl - VCF utilities script
vrfs-variances - Variant frequency variances tool
A character string containing the path to the bcftools bin directory.
bcftools_bin_dir()bcftools_bin_dir()
Returns the path to the bcftools library files for use in linking.
bcftools_lib_dir()bcftools_lib_dir()
This directory contains libbcftools.a (static) and libbcftools.so
(shared) libraries.
A character string containing the path to the bcftools lib directory.
bcftools_lib_dir()bcftools_lib_dir()
Returns the linker flags needed to link against the bcftools library.
bcftools_libs()bcftools_libs()
Note that bcftools library also depends on htslib, so you typically need
to include both bcftools_libs() and htslib_libs() in your linker flags.
A character string containing linker flags including -L library
path and -l library name.
bcftools_libs() # Full linking: paste(RBCFTools::bcftools_libs(), RBCFTools::htslib_libs())bcftools_libs() # Full linking: paste(RBCFTools::bcftools_libs(), RBCFTools::htslib_libs())
Returns the path to the bundled bcftools executable.
bcftools_path()bcftools_path()
A character string containing the path to the bcftools executable.
bcftools_path()bcftools_path()
Returns the path to the directory containing bcftools plugins.
bcftools_plugins_dir()bcftools_plugins_dir()
A character string containing the path to the bcftools plugins directory.
bcftools_plugins_dir()bcftools_plugins_dir()
Lists all available scripts and tools in the bcftools bin directory.
bcftools_tools()bcftools_tools()
A character vector of available tool names.
bcftools_tools()bcftools_tools()
Returns the version string of the bundled bcftools library.
bcftools_version()bcftools_version()
A character string containing the bcftools version.
bcftools_version()bcftools_version()
Returns the path to the bundled bgzip executable.
bgzip_path()bgzip_path()
A character string containing the path to the bgzip executable.
bgzip_path()bgzip_path()
Utilities to load the DuckLake extension, attach a lake (local or S3-backed), configure S3 secrets, and write variants using either direct DuckDB insert or parallel Parquet conversion.
Attach a DuckLake catalog (legacy function)
ducklake_attach( con, metadata_path, data_path, alias = "ducklake", read_only = FALSE, create_if_missing = TRUE, extra_options = list() )ducklake_attach( con, metadata_path, data_path, alias = "ducklake", read_only = FALSE, create_if_missing = TRUE, extra_options = list() )
con |
A DuckDB connection with DuckLake loaded. |
metadata_path |
Path/URI to the DuckLake metadata DB (without the |
data_path |
Path/URI for table data (Parquet files). |
alias |
Schema alias to attach as. Default: "ducklake". |
read_only |
Logical, open lake read-only. Default: FALSE. |
create_if_missing |
Logical, create metadata DB if missing. Default: TRUE. |
extra_options |
Named list of additional ATTACH options (e.g., list(METADATA_CATALOG = "meta")). |
Invisible NULL.
ducklake_connect_catalog for abstracted backend support.
Connect to a DuckLake catalog with abstracted backend support
ducklake_connect_catalog( con, backend = c("duckdb", "sqlite", "postgresql", "mysql"), connection_string = NULL, data_path = NULL, alias = "ducklake", secret_name = NULL, read_only = FALSE, create_if_missing = TRUE, extra_options = list() )ducklake_connect_catalog( con, backend = c("duckdb", "sqlite", "postgresql", "mysql"), connection_string = NULL, data_path = NULL, alias = "ducklake", secret_name = NULL, read_only = FALSE, create_if_missing = TRUE, extra_options = list() )
con |
A DuckDB connection with DuckLake loaded. |
backend |
Database backend type ("duckdb", "sqlite", "postgresql", "mysql"). |
connection_string |
Database connection string (format depends on backend). |
data_path |
Path/URI for table data (Parquet files). Required for new lakes. |
alias |
Schema alias to attach as. Default: "ducklake". |
secret_name |
Optional secret name to use instead of direct connection parameters. |
read_only |
Logical, open lake read-only. Default: FALSE. |
create_if_missing |
Logical, create metadata DB if missing. Default: TRUE. |
extra_options |
Named list of additional ATTACH options. |
Invisible NULL.
Create a DuckLake catalog secret for database credentials
ducklake_create_catalog_secret( con, name = "ducklake_catalog", backend = c("duckdb", "sqlite", "postgresql", "mysql"), connection_string, data_path = NULL, metadata_parameters = list(), persistent = FALSE )ducklake_create_catalog_secret( con, name = "ducklake_catalog", backend = c("duckdb", "sqlite", "postgresql", "mysql"), connection_string, data_path = NULL, metadata_parameters = list(), persistent = FALSE )
con |
A DuckDB connection. |
name |
Secret name (identifier). Default: "ducklake_catalog". |
backend |
Database backend type ("duckdb", "sqlite", "postgresql", "mysql"). |
connection_string |
Database connection string (without ducklake: prefix). |
data_path |
Default data path for this catalog. Optional. |
metadata_parameters |
Named list of additional metadata parameters. |
persistent |
Logical, create a persistent secret. Default: FALSE. |
Invisible NULL.
Create or replace an S3 secret for DuckLake
ducklake_create_s3_secret( con, name = "ducklake_s3", key_id, secret, endpoint = NULL, region = NULL, use_ssl = TRUE, url_style = "path", session_token = NULL )ducklake_create_s3_secret( con, name = "ducklake_s3", key_id, secret, endpoint = NULL, region = NULL, use_ssl = TRUE, url_style = "path", session_token = NULL )
con |
A DuckDB connection. |
name |
Secret name (identifier). Default: "ducklake_s3". |
key_id |
S3 key ID. |
secret |
S3 secret key. |
endpoint |
Optional S3-compatible endpoint (e.g., "s3.us-east-1.amazonaws.com" or "minio:9000"). |
region |
Optional region. |
use_ssl |
Logical, whether to use SSL. Default: TRUE. |
url_style |
URL style ("path" or "virtual_host"). Default: "path". |
session_token |
Optional session token. |
Invisible NULL.
Get current snapshot ID
ducklake_current_snapshot(con, catalog = "lake")ducklake_current_snapshot(con, catalog = "lake")
con |
DuckDB connection with DuckLake attached. |
catalog |
DuckLake catalog name. |
Integer snapshot ID.
Download a static MinIO client (mc) binary
ducklake_download_mc(dest_dir = tempdir(), url = NULL, filename = "mc")ducklake_download_mc(dest_dir = tempdir(), url = NULL, filename = "mc")
dest_dir |
Destination directory (created if missing). |
url |
Optional download URL. Defaults to mc Linux build for host arch. |
filename |
Output filename. Defaults to "mc". |
Path to downloaded binary.
Download a static MinIO server binary
ducklake_download_minio(dest_dir = tempdir(), url = NULL, filename = "minio")ducklake_download_minio(dest_dir = tempdir(), url = NULL, filename = "minio")
dest_dir |
Destination directory (created if missing). |
url |
Optional download URL. Defaults to MinIO Linux build for host arch. |
filename |
Output filename. Defaults to "minio". |
Path to downloaded binary.
Drop a DuckLake catalog secret
ducklake_drop_secret(con, name)ducklake_drop_secret(con, name)
con |
A DuckDB connection. |
name |
Secret name to drop. |
Invisible NULL.
List files managed by DuckLake for a table
ducklake_list_files(con, catalog = "lake", table, schema = "main")ducklake_list_files(con, catalog = "lake", table, schema = "main")
con |
DuckDB connection with DuckLake attached. |
catalog |
DuckLake catalog name. |
table |
Table name. |
schema |
Schema name (default "main"). |
Data frame with file information.
List existing DuckLake catalog secrets
ducklake_list_secrets(con)ducklake_list_secrets(con)
con |
A DuckDB connection. |
Data frame with columns: name, type, metadata_path, data_path.
Load the DuckLake extension
ducklake_load(con, install = TRUE)ducklake_load(con, install = TRUE)
con |
A DuckDB connection. |
install |
Logical, attempt |
The connection (invisibly).
Converts VCF/BCF to Parquet using the fast bcf_reader extension, then
registers the Parquet file in a DuckLake catalog table.
ducklake_load_vcf( con, table, vcf_path, extension_path, output_path = NULL, threads = parallel::detectCores(), compression = "zstd", row_group_size = 100000L, region = NULL, columns = NULL, overwrite = FALSE, allow_evolution = FALSE, tidy_format = FALSE, partition_by = NULL )ducklake_load_vcf( con, table, vcf_path, extension_path, output_path = NULL, threads = parallel::detectCores(), compression = "zstd", row_group_size = 100000L, region = NULL, columns = NULL, overwrite = FALSE, allow_evolution = FALSE, tidy_format = FALSE, partition_by = NULL )
con |
DuckDB connection with DuckLake attached. |
table |
Target table name (optionally qualified, e.g., "lake.variants"). |
vcf_path |
Path/URI to VCF/BCF file. |
extension_path |
Path to bcf_reader.duckdb_extension (required). |
output_path |
Optional Parquet output path. If NULL, uses DuckLake's DATA_PATH. |
threads |
Number of threads for conversion. |
compression |
Parquet compression codec. |
row_group_size |
Parquet row group size. |
region |
Optional region filter (e.g., "chr1:1000-2000"). |
columns |
Optional character vector of columns to include. |
overwrite |
Logical, drop existing table first. |
allow_evolution |
Logical, evolve table schema by adding new columns from VCF. Default: FALSE. When TRUE, new columns found in the VCF are added via ALTER TABLE before insertion, making all columns queryable. Useful for combining VCFs with different annotations (e.g., VEP columns) or different samples (FORMAT_*_SampleName). |
tidy_format |
Logical, if TRUE exports data in tidy (long) format with one row per variant-sample combination and a SAMPLE_ID column. Default FALSE. Ideal for cohort analysis and combining multiple single-sample VCFs. |
partition_by |
Optional character vector of columns to partition by (Hive-style).
Creates directory structure like |
This is the recommended function for loading VCF data into DuckLake.
It uses the bcf_reader DuckDB extension for fast VCF→Parquet conversion,
which is significantly faster than the nanoarrow streaming path.
Workflow:
VCF → Parquet via vcf_to_parquet_duckdb() (bcf_reader)
Register Parquet in DuckLake catalog
Schema Evolution (allow_evolution = TRUE):
When loading multiple VCFs with different schemas (e.g., different samples
or different annotation fields), enable allow_evolution to automatically
add new columns to the table schema. This uses DuckLake's ALTER TABLE ADD COLUMN
which preserves existing data files without rewriting.
Tidy Format (tidy_format = TRUE):
When building cohort tables from multiple single-sample VCFs, use tidy_format = TRUE
to get one row per variant-sample combination with a SAMPLE_ID column. This format
is ideal for downstream analysis and MERGE/UPSERT operations on DuckLake tables.
Partitioning (partition_by):
When using partition_by, the output is a Hive-partitioned directory structure.
This is useful for large cohorts where you want efficient per-sample queries.
DuckDB auto-generates Bloom filters for VARCHAR columns like SAMPLE_ID.
Note: For DuckLake, partitioned output requires manual file registration.
Invisibly returns the path to the created Parquet file.
## Not run: # Build extension ext_path <- bcf_reader_build(tempdir()) # Setup DuckLake con <- duckdb::dbConnect(duckdb::duckdb()) ducklake_load(con) ducklake_attach(con, "catalog.ducklake", "/data/parquet/", alias = "lake") DBI::dbExecute(con, "USE lake") # Load first VCF ducklake_load_vcf(con, "variants", "sample1.vcf.gz", ext_path, threads = 8) # Load second VCF with different annotations, evolving schema ducklake_load_vcf(con, "variants", "sample2_vep.vcf.gz", ext_path, allow_evolution = TRUE ) # Load VCF in tidy format (one row per variant-sample) ducklake_load_vcf(con, "variants_tidy", "cohort.vcf.gz", ext_path, tidy_format = TRUE ) # Query - all columns from both VCFs are available DBI::dbGetQuery(con, "SELECT CHROM, COUNT(*) FROM variants GROUP BY CHROM") ## End(Not run)## Not run: # Build extension ext_path <- bcf_reader_build(tempdir()) # Setup DuckLake con <- duckdb::dbConnect(duckdb::duckdb()) ducklake_load(con) ducklake_attach(con, "catalog.ducklake", "/data/parquet/", alias = "lake") DBI::dbExecute(con, "USE lake") # Load first VCF ducklake_load_vcf(con, "variants", "sample1.vcf.gz", ext_path, threads = 8) # Load second VCF with different annotations, evolving schema ducklake_load_vcf(con, "variants", "sample2_vep.vcf.gz", ext_path, allow_evolution = TRUE ) # Load VCF in tidy format (one row per variant-sample) ducklake_load_vcf(con, "variants_tidy", "cohort.vcf.gz", ext_path, tidy_format = TRUE ) # Query - all columns from both VCFs are available DBI::dbGetQuery(con, "SELECT CHROM, COUNT(*) FROM variants GROUP BY CHROM") ## End(Not run)
Merge/upsert data into a DuckLake table
ducklake_merge( con, target, source, on_cols, when_matched = "UPDATE", when_not_matched = "INSERT", update_cols = NULL )ducklake_merge( con, target, source, on_cols, when_matched = "UPDATE", when_not_matched = "INSERT", update_cols = NULL )
con |
DuckDB connection with DuckLake attached. |
target |
Target table name. |
source |
Source table/query. |
on_cols |
Column(s) to match on. |
when_matched |
Action when matched: "UPDATE", "DELETE", or NULL. |
when_not_matched |
Action when not matched: "INSERT" or NULL. |
update_cols |
Columns to update (NULL = all columns). |
Number of rows affected.
Get DuckLake configuration options
ducklake_options(con, catalog = "lake")ducklake_options(con, catalog = "lake")
con |
DuckDB connection with DuckLake attached. |
catalog |
DuckLake catalog name. |
Data frame with current options.
Parse DuckLake connection string into components
ducklake_parse_connection_string(connection_string)ducklake_parse_connection_string(connection_string)
connection_string |
DuckLake connection string (e.g., "ducklake:path/to/catalog.ducklake"). |
Named list with components: backend, metadata_path, data_path (if specified).
Query table at a specific snapshot (time travel)
ducklake_query_snapshot(con, table, snapshot_id, query = "SELECT * FROM tbl")ducklake_query_snapshot(con, table, snapshot_id, query = "SELECT * FROM tbl")
con |
DuckDB connection with DuckLake attached. |
table |
Table name. |
snapshot_id |
Snapshot version to query. |
query |
SQL query (use 'tbl' as table alias). |
Query result as data frame.
Adds Parquet files that already exist (from prior ETL) to a DuckLake table. This is a catalog-only operation; data files are not copied or moved.
ducklake_register_parquet( con, table, parquet_files, create_table = TRUE, allow_missing = FALSE, ignore_extra_columns = FALSE, allow_evolution = FALSE )ducklake_register_parquet( con, table, parquet_files, create_table = TRUE, allow_missing = FALSE, ignore_extra_columns = FALSE, allow_evolution = FALSE )
con |
DuckDB connection with DuckLake attached. |
table |
Target table name (optionally qualified, e.g., "lake.variants"). |
parquet_files |
Character vector of Parquet file paths/URIs. |
create_table |
Logical, create the table if it doesn't exist. Default: TRUE. When TRUE, schema is inferred from the first Parquet file. |
allow_missing |
Logical, allow missing columns (filled with defaults). Default: FALSE. |
ignore_extra_columns |
Logical, ignore extra columns in files. Default: FALSE. |
allow_evolution |
Logical, evolve table schema by adding new columns from files. Default: FALSE. When TRUE, new columns found in files are added via ALTER TABLE before registration, making all columns queryable. |
This function uses DuckLake's ducklake_add_data_files() to register
external Parquet files in the catalog. The files must already exist and
have a schema compatible with the target table.
Schema Evolution (allow_evolution = TRUE):
When enabled, the function compares each file's schema against the table schema
and adds any missing columns via ALTER TABLE ADD COLUMN before registration.
This allows combining VCF files with different annotations (e.g., VEP columns)
into a single table where all columns are queryable.
Invisibly returns the number of files registered.
## Not run: # Register a Parquet file created by vcf_to_parquet_duckdb() ducklake_register_parquet(con, "variants", "s3://bucket/variants.parquet") # Register with schema evolution (add new columns from file) ducklake_register_parquet(con, "variants", "s3://bucket/vep_variants.parquet", allow_evolution = TRUE ) ## End(Not run)## Not run: # Register a Parquet file created by vcf_to_parquet_duckdb() ducklake_register_parquet(con, "variants", "s3://bucket/variants.parquet") # Register with schema evolution (add new columns from file) ducklake_register_parquet(con, "variants", "s3://bucket/vep_variants.parquet", allow_evolution = TRUE ) ## End(Not run)
Must be called within a transaction (BEGIN/COMMIT block).
ducklake_set_commit_message( con, catalog = "lake", author, message, extra_info = NULL )ducklake_set_commit_message( con, catalog = "lake", author, message, extra_info = NULL )
con |
DuckDB connection with DuckLake attached. |
catalog |
DuckLake catalog name. |
author |
Author name. |
message |
Commit message. |
extra_info |
Optional JSON string with extra metadata. |
Invisible NULL.
Set DuckLake configuration option
ducklake_set_option( con, catalog = "lake", option, value, schema = NULL, table_name = NULL )ducklake_set_option( con, catalog = "lake", option, value, schema = NULL, table_name = NULL )
con |
DuckDB connection with DuckLake attached. |
catalog |
DuckLake catalog name. |
option |
Option name (e.g., "parquet_compression", "parquet_row_group_size"). |
value |
Option value. |
schema |
Optional schema scope. |
table_name |
Optional table scope. |
Common options:
parquet_compression: snappy, zstd, gzip, lz4
parquet_row_group_size: rows per row group (default 122880)
target_file_size: target file size for compaction (default 512MB
data_inlining_row_limit: max rows to inline (default 0)
Invisible NULL.
List DuckLake snapshots
ducklake_snapshots(con, catalog = "lake")ducklake_snapshots(con, catalog = "lake")
con |
DuckDB connection with DuckLake attached. |
catalog |
DuckLake catalog name (alias used in ATTACH). |
Data frame with snapshot history.
Update an existing DuckLake catalog secret
ducklake_update_secret( con, name, connection_string, data_path = NULL, metadata_parameters = list() )ducklake_update_secret( con, name, connection_string, data_path = NULL, metadata_parameters = list() )
con |
A DuckDB connection. |
name |
Secret name to update. |
connection_string |
New database connection string. |
data_path |
New default data path. Optional. |
metadata_parameters |
New named list of metadata parameters. |
Invisible NULL.
Returns the path to the bundled htsfile executable for identifying file formats.
htsfile_path()htsfile_path()
A character string containing the path to the htsfile executable.
htsfile_path()htsfile_path()
Returns the path to the directory containing htslib executables.
htslib_bin_dir()htslib_bin_dir()
The directory contains the following tools:
annot-tsv - Annotate TSV files
bgzip - Block gzip compression
htsfile - Identify file format
ref-cache - Reference sequence cache management
tabix - Index and query TAB-delimited files
A character string containing the path to the htslib bin directory.
htslib_bin_dir()htslib_bin_dir()
Returns a named list of all capabilities of the bundled htslib library.
htslib_capabilities()htslib_capabilities()
A named list with logical values for each capability:
Whether ./configure was used to build.
Whether plugins are enabled.
Whether libcurl support is enabled.
Whether S3 support is enabled.
Whether Google Cloud Storage support is enabled.
Whether libdeflate compression is enabled.
Whether LZMA compression is enabled.
Whether bzip2 compression is enabled.
Whether htscodecs library is available.
caps <- htslib_capabilities() caps$libcurl caps$s3caps <- htslib_capabilities() caps$libcurl caps$s3
Returns the compiler flags (CFLAGS/CPPFLAGS) needed to compile code that uses htslib.
htslib_cflags()htslib_cflags()
A character string containing compiler flags including the -I
include path.
htslib_cflags() # Use in Makevars: PKG_CPPFLAGS = $(shell Rscript -e "cat(RBCFTools::htslib_cflags())")htslib_cflags() # Use in Makevars: PKG_CPPFLAGS = $(shell Rscript -e "cat(RBCFTools::htslib_cflags())")
Returns a human-readable string describing the enabled features in htslib.
htslib_feature_string()htslib_feature_string()
A character string describing the enabled features.
htslib_feature_string()htslib_feature_string()
Returns the raw bitfield of enabled features in htslib.
htslib_features()htslib_features()
An integer representing the feature bitfield.
htslib_features()htslib_features()
Checks if a specific feature is enabled in the bundled htslib library.
htslib_has_feature(feature_id) HTS_FEATURE_CONFIGURE HTS_FEATURE_PLUGINS HTS_FEATURE_LIBCURL HTS_FEATURE_S3 HTS_FEATURE_GCS HTS_FEATURE_LIBDEFLATE HTS_FEATURE_LZMA HTS_FEATURE_BZIP2 HTS_FEATURE_HTSCODECShtslib_has_feature(feature_id) HTS_FEATURE_CONFIGURE HTS_FEATURE_PLUGINS HTS_FEATURE_LIBCURL HTS_FEATURE_S3 HTS_FEATURE_GCS HTS_FEATURE_LIBDEFLATE HTS_FEATURE_LZMA HTS_FEATURE_BZIP2 HTS_FEATURE_HTSCODECS
feature_id |
An integer feature ID. Use one of the |
An object of class integer of length 1.
An object of class integer of length 1.
An object of class integer of length 1.
An object of class integer of length 1.
An object of class integer of length 1.
An object of class integer of length 1.
An object of class integer of length 1.
An object of class integer of length 1.
An object of class integer of length 1.
A logical value indicating if the feature is enabled.
# Check for libcurl support (feature ID 1024) htslib_has_feature(1024L)# Check for libcurl support (feature ID 1024) htslib_has_feature(1024L)
Returns the path to the htslib header files for use in compilation.
htslib_include_dir()htslib_include_dir()
This directory contains the htslib headers (e.g., htslib/hts.h,
htslib/vcf.h, etc.). Use this path with -I compiler flag when
compiling code that uses htslib.
A character string containing the path to the htslib include directory.
htslib_include_dir()htslib_include_dir()
Returns the path to the htslib library files for use in linking.
htslib_lib_dir()htslib_lib_dir()
This directory contains libhts.a (static) and libhts.so (shared)
libraries. Use this path with -L linker flag when linking against htslib.
A character string containing the path to the htslib lib directory.
htslib_lib_dir()htslib_lib_dir()
Returns the linker flags needed to link against htslib.
htslib_libs(static = FALSE)htslib_libs(static = FALSE)
static |
Logical. If |
For dynamic linking, returns -L<libdir> -lhts.
For static linking, also includes the dependent libraries:
-lpthread -lz -lm -lbz2 -llzma -ldeflate.
A character string containing linker flags including -L library
path and -l library names.
htslib_libs() htslib_libs(static = TRUE) # Use in Makevars: PKG_LIBS = $(shell Rscript -e "cat(RBCFTools::htslib_libs())")htslib_libs() htslib_libs(static = TRUE) # Use in Makevars: PKG_LIBS = $(shell Rscript -e "cat(RBCFTools::htslib_libs())")
Returns the path to the directory containing htslib plugins (e.g., for remote file access via libcurl, S3, GCS).
htslib_plugins_dir()htslib_plugins_dir()
A character string containing the path to the htslib plugins directory.
htslib_plugins_dir()htslib_plugins_dir()
Lists all available tools in the htslib bin directory.
htslib_tools()htslib_tools()
A character vector of available tool names.
htslib_tools()htslib_tools()
Returns the version string of the bundled htslib library.
htslib_version()htslib_version()
A character string containing the htslib version.
htslib_version()htslib_version()
Returns a list with all paths and flags needed for linking against htslib and bcftools from this package.
linking_info()linking_info()
A named list with the following elements:
Path to htslib include directory
Path to htslib library directory
Path to bcftools library directory
Compiler flags for htslib
Linker flags for htslib (dynamic)
Linker flags for htslib (static)
Linker flags for bcftools
Combined linker flags for both bcftools and htslib
info <- linking_info() info$cflags info$all_libsinfo <- linking_info() info$cflags info$all_libs
Reads the custom key-value metadata stored in a Parquet file's footer.
This includes the full VCF header if the file was created with
vcf_to_parquet_duckdb with include_metadata = TRUE.
parquet_kv_metadata(file, con = NULL)parquet_kv_metadata(file, con = NULL)
file |
Path to Parquet file |
con |
Optional existing DuckDB connection |
A data frame with columns: key, value. Returns empty data frame if no custom metadata exists.
## Not run: meta <- parquet_kv_metadata("variants.parquet") # Get the VCF header vcf_header <- meta[meta$key == "vcf_header", "value"] cat(vcf_header) ## End(Not run)## Not run: meta <- parquet_kv_metadata("variants.parquet") # Get the VCF header vcf_header <- meta[meta$key == "vcf_header", "value"] cat(vcf_header) ## End(Not run)
Reconstruct a VCF file from Parquet data created by vcf_to_parquet_duckdb.
Uses the VCF header stored in Parquet metadata for proper formatting.
parquet_to_vcf( input_file, output_file, header = NULL, index = TRUE, con = NULL )parquet_to_vcf( input_file, output_file, header = NULL, index = TRUE, con = NULL )
input_file |
Path to input Parquet file (must have VCF metadata) |
output_file |
Path to output VCF/VCF.GZ/BCF file. Format determined by extension. |
header |
Optional VCF header string. If NULL (default), reads from Parquet metadata. |
index |
Logical, if TRUE creates tabix/CSI index for output. Default TRUE. |
con |
Optional existing DuckDB connection |
Invisible path to output file
## Not run: # Round-trip: VCF -> Parquet -> VCF vcf_file <- system.file("extdata", "1000G_3samples.vcf.gz", package = "RBCFTools") ext_path <- bcf_reader_build(tempdir(), verbose = FALSE) # Convert to Parquet (with metadata) parquet_file <- tempfile(fileext = ".parquet") vcf_to_parquet_duckdb(vcf_file, parquet_file, ext_path) # Convert back to VCF vcf_out <- tempfile(fileext = ".vcf.gz") parquet_to_vcf(parquet_file, vcf_out) ## End(Not run)## Not run: # Round-trip: VCF -> Parquet -> VCF vcf_file <- system.file("extdata", "1000G_3samples.vcf.gz", package = "RBCFTools") ext_path <- bcf_reader_build(tempdir(), verbose = FALSE) # Convert to Parquet (with metadata) parquet_file <- tempfile(fileext = ".parquet") vcf_to_parquet_duckdb(vcf_file, parquet_file, ext_path) # Convert back to VCF vcf_out <- tempfile(fileext = ".vcf.gz") parquet_to_vcf(parquet_file, vcf_out) ## End(Not run)
Prints example Makevars configuration that can be used by packages that want to link against htslib and/or bcftools via LinkingTo.
print_makevars_config(use_bcftools = FALSE, static = FALSE)print_makevars_config(use_bcftools = FALSE, static = FALSE)
use_bcftools |
Logical. If |
static |
Logical. If |
Invisibly returns the Makevars text as a character string.
# Print Makevars for htslib only print_makevars_config() # Print Makevars for both bcftools and htslib print_makevars_config(use_bcftools = TRUE)# Print Makevars for htslib only print_makevars_config() # Print Makevars for both bcftools and htslib print_makevars_config(use_bcftools = TRUE)
Print method for vcf_duckdb objects
## S3 method for class 'vcf_duckdb' print(x, ...)## S3 method for class 'vcf_duckdb' print(x, ...)
x |
A vcf_duckdb object |
... |
Additional arguments (ignored) |
Returns the path to the bundled ref-cache executable for reference sequence cache management.
ref_cache_path()ref_cache_path()
A character string containing the path to the ref-cache executable.
ref_cache_path()ref_cache_path()
Sets the HTS_PATH environment variable to point to the bundled htslib
plugins directory. This is required for S3, GCS, and other remote file
access via libcurl.
setup_hts_env()setup_hts_env()
Call this function before using bcftools/htslib tools with remote URLs
(s3://, gs://, http://, etc.). The function sets HTS_PATH to the package's
plugin directory so htslib can find hfile_libcurl.so and hfile_gcs.so.
Invisibly returns the previous value of HTS_PATH (or NA if unset).
setup_hts_env() # Now bcftools can access S3 URLssetup_hts_env() # Now bcftools can access S3 URLs
Returns the path to the bundled tabix executable.
tabix_path()tabix_path()
A character string containing the path to the tabix executable.
tabix_path()tabix_path()
Reads the header of a VCF/BCF file and returns the corresponding Arrow schema.
vcf_arrow_schema(filename)vcf_arrow_schema(filename)
filename |
Path to VCF or BCF file |
A nanoarrow_schema object
Properly closes the DuckDB connection opened by vcf_open_duckdb.
vcf_close_duckdb(vcf, shutdown = TRUE)vcf_close_duckdb(vcf, shutdown = TRUE)
vcf |
A vcf_duckdb object returned by vcf_open_duckdb |
shutdown |
Logical, whether to shutdown the DuckDB instance (default: TRUE) |
Invisible NULL
## Not run: vcf <- vcf_open_duckdb("variants.vcf.gz", ext_path) # ... do queries ... vcf_close_duckdb(vcf) ## End(Not run)## Not run: vcf <- vcf_open_duckdb("variants.vcf.gz", ext_path) # ... do queries ... vcf_close_duckdb(vcf) ## End(Not run)
Fast variant count using DuckDB projection pushdown.
vcf_count_duckdb( file, extension_path = NULL, region = NULL, tidy_format = FALSE, con = NULL )vcf_count_duckdb( file, extension_path = NULL, region = NULL, tidy_format = FALSE, con = NULL )
file |
Path to VCF, VCF.GZ, or BCF file |
extension_path |
Path to the bcf_reader.duckdb_extension file. |
region |
Optional genomic region for indexed files |
tidy_format |
Logical, if TRUE counts rows in tidy format (one per variant-sample). Default FALSE returns count of variants. |
con |
Optional existing DuckDB connection (with extension loaded). |
Integer count of variants (or variant-sample combinations if tidy_format=TRUE)
## Not run: ext_path <- bcf_reader_build(tempdir()) vcf_count_duckdb("variants.vcf.gz", ext_path) vcf_count_duckdb("variants.vcf.gz", ext_path, region = "chr22") # Count variant-sample rows (variants * samples) vcf_count_duckdb("cohort.vcf.gz", ext_path, tidy_format = TRUE) ## End(Not run)## Not run: ext_path <- bcf_reader_build(tempdir()) vcf_count_duckdb("variants.vcf.gz", ext_path) vcf_count_duckdb("variants.vcf.gz", ext_path, region = "chr22") # Count variant-sample rows (variants * samples) vcf_count_duckdb("cohort.vcf.gz", ext_path, tidy_format = TRUE) ## End(Not run)
Uses bcftools index –stats to get per-contig variant counts. Requires an indexed file.
vcf_count_per_contig(filename)vcf_count_per_contig(filename)
filename |
Path to VCF/BCF file (must be indexed) |
Named integer vector (names = contigs, values = variant counts)
## Not run: counts <- vcf_count_per_contig("variants.vcf.gz") # chr1: 12345, chr2: 23456, ... ## End(Not run)## Not run: counts <- vcf_count_per_contig("variants.vcf.gz") # chr1: 12345, chr2: 23456, ... ## End(Not run)
Uses bundled bcftools to count variants efficiently. For indexed files, this is very fast. Can also count per-chromosome.
vcf_count_variants(filename, region = NULL)vcf_count_variants(filename, region = NULL)
filename |
Path to VCF/BCF file |
region |
Optional region string (e.g., "chr1" or "chr1:1-1000") |
Integer count of variants
## Not run: # Total variants n <- vcf_count_variants("variants.vcf.gz") # Variants on chr1 n_chr1 <- vcf_count_variants("variants.vcf.gz", region = "chr1") ## End(Not run)## Not run: # Total variants n <- vcf_count_variants("variants.vcf.gz") # Variants on chr1 n_chr1 <- vcf_count_variants("variants.vcf.gz", region = "chr1") ## End(Not run)
Functions for querying VCF/BCF files using DuckDB with the bcf_reader extension.
Creates a DuckDB connection and loads the bcf_reader extension for VCF/BCF queries.
vcf_duckdb_connect( extension_path, dbdir = ":memory:", read_only = FALSE, config = list() )vcf_duckdb_connect( extension_path, dbdir = ":memory:", read_only = FALSE, config = list() )
extension_path |
Path to the bcf_reader.duckdb_extension file. Must be explicitly provided. |
dbdir |
Database directory. Default ":memory:" for in-memory database. |
read_only |
Logical, whether to open in read-only mode. Default FALSE. |
config |
Named list of DuckDB configuration options. |
A DuckDB connection object with bcf_reader extension loaded
## Not run: # First build the extension ext_path <- bcf_reader_build(tempdir()) # Then connect con <- vcf_duckdb_connect(ext_path) DBI::dbGetQuery(con, "SELECT * FROM bcf_read('variants.vcf.gz') LIMIT 10") DBI::dbDisconnect(con) ## End(Not run)## Not run: # First build the extension ext_path <- bcf_reader_build(tempdir()) # Then connect con <- vcf_duckdb_connect(ext_path) DBI::dbGetQuery(con, "SELECT * FROM bcf_read('variants.vcf.gz') LIMIT 10") DBI::dbDisconnect(con) ## End(Not run)
Extracts contig names and lengths from the VCF/BCF header.
vcf_get_contig_lengths(filename)vcf_get_contig_lengths(filename)
filename |
Path to VCF/BCF file |
Named integer vector (names = contigs, values = lengths)
## Not run: lengths <- vcf_get_contig_lengths("variants.vcf.gz") ## End(Not run)## Not run: lengths <- vcf_get_contig_lengths("variants.vcf.gz") ## End(Not run)
Extracts contig names from the VCF/BCF header using htslib.
vcf_get_contigs(filename)vcf_get_contigs(filename)
filename |
Path to VCF/BCF file |
Character vector of contig names
## Not run: contigs <- vcf_get_contigs("variants.vcf.gz") ## End(Not run)## Not run: contigs <- vcf_get_contigs("variants.vcf.gz") ## End(Not run)
Uses htslib to robustly check for index presence. Works with local files, remote URLs (S3, GCS, HTTP), and custom index paths.
vcf_has_index(filename, index = NULL)vcf_has_index(filename, index = NULL)
filename |
Path to VCF/BCF file |
index |
Optional explicit index path |
Logical indicating if index exists
## Not run: vcf_has_index("variants.vcf.gz") vcf_has_index("s3://bucket/file.vcf.gz") vcf_has_index("file.vcf.gz", index = "custom.tbi") ## End(Not run)## Not run: vcf_has_index("variants.vcf.gz") vcf_has_index("s3://bucket/file.vcf.gz") vcf_has_index("file.vcf.gz", index = "custom.tbi") ## End(Not run)
Extracts the full VCF header from a file for embedding in Parquet metadata. This allows round-tripping back to VCF format by preserving all header information (INFO, FORMAT, FILTER definitions, contigs, samples).
vcf_header_metadata(file)vcf_header_metadata(file)
file |
Path to VCF, VCF.GZ, or BCF file |
A named list with two elements:
vcf_header: The complete VCF header (all lines starting with #)
RBCFTools_version: Package version that created the Parquet
## Not run: vcf_file <- system.file("extdata", "1000G_3samples.vcf.gz", package = "RBCFTools") meta <- vcf_header_metadata(vcf_file) cat(meta$vcf_header) ## End(Not run)## Not run: vcf_file <- system.file("extdata", "1000G_3samples.vcf.gz", package = "RBCFTools") meta <- vcf_header_metadata(vcf_file) cat(meta$vcf_header) ## End(Not run)
Opens a VCF or BCF file and creates an Arrow array stream that produces record batches. This enables efficient, streaming access to variant data in Arrow format.
vcf_open_arrow( filename, batch_size = 10000L, region = NULL, samples = NULL, include_info = TRUE, include_format = TRUE, index = NULL, threads = 0L, parse_vep = FALSE, vep_tag = NULL, vep_columns = NULL, vep_transcript = c("first", "all") )vcf_open_arrow( filename, batch_size = 10000L, region = NULL, samples = NULL, include_info = TRUE, include_format = TRUE, index = NULL, threads = 0L, parse_vep = FALSE, vep_tag = NULL, vep_columns = NULL, vep_transcript = c("first", "all") )
filename |
Path to VCF or BCF file |
batch_size |
Number of records per batch (default: 10000) |
region |
Optional region string for filtering (e.g., "chr1:1000-2000") |
samples |
Optional sample filter (comma-separated names or "-" prefixed to exclude) |
include_info |
Include INFO fields in output (default: TRUE) |
include_format |
Include FORMAT/sample data in output (default: TRUE) |
index |
Optional index file path. If NULL (default), uses auto-detection: VCF files try .tbi first, then .csi; BCF files use .csi only. Useful for non-standard index locations or presigned URLs with different paths. Alternatively, use htslib ##idx## syntax in filename (e.g., "file.vcf.gz##idx##custom.tbi"). Note: Index is only required for region queries; whole-file streaming needs no index. |
threads |
Number of decompression threads (default: 0 = auto) |
parse_vep |
Enable VEP/BCSQ/ANN annotation parsing (default: FALSE). When TRUE, annotation fields are parsed and added as typed columns. |
vep_tag |
Annotation tag to parse ("CSQ", "BCSQ", "ANN") or NULL for auto-detect. |
vep_columns |
Character vector of VEP fields to extract, or NULL for all fields. |
vep_transcript |
Which transcript to extract: "first" (default) or "all". "first" returns scalar columns (one value per variant). "all" returns list columns (all transcripts per variant). |
A nanoarrow_array_stream object
## Not run: # Basic usage stream <- vcf_open_arrow("variants.vcf.gz") # Read batches while (!is.null(batch <- stream$get_next())) { # Process batch... print(nanoarrow::convert_array(batch)) } # With region filter stream <- vcf_open_arrow("variants.vcf.gz", region = "chr1:1-1000000") # With custom index file (useful for presigned URLs or non-standard locations) stream <- vcf_open_arrow("variants.vcf.gz", index = "custom_path.tbi", region = "chr1") # Convert to data frame df <- vcf_to_arrow("variants.vcf.gz", as = "data.frame") # Write to parquet (uses DuckDB, no arrow package needed) vcf_to_parquet_arrow("variants.vcf.gz", "variants.parquet") ## End(Not run)## Not run: # Basic usage stream <- vcf_open_arrow("variants.vcf.gz") # Read batches while (!is.null(batch <- stream$get_next())) { # Process batch... print(nanoarrow::convert_array(batch)) } # With region filter stream <- vcf_open_arrow("variants.vcf.gz", region = "chr1:1-1000000") # With custom index file (useful for presigned URLs or non-standard locations) stream <- vcf_open_arrow("variants.vcf.gz", index = "custom_path.tbi", region = "chr1") # Convert to data frame df <- vcf_to_arrow("variants.vcf.gz", as = "data.frame") # Write to parquet (uses DuckDB, no arrow package needed) vcf_to_parquet_arrow("variants.vcf.gz", "variants.parquet") ## End(Not run)
Creates a DuckDB connection with the VCF data loaded as a table or view. Supports in-memory or file-backed databases, tidy format output, parallel loading by chromosome, column selection, and optional Hive partitioning.
vcf_open_duckdb( file, extension_path, table_name = "variants", as_view = TRUE, dbdir = ":memory:", columns = NULL, region = NULL, tidy_format = FALSE, threads = 1L, partition_by = NULL, overwrite = FALSE, config = list() )vcf_open_duckdb( file, extension_path, table_name = "variants", as_view = TRUE, dbdir = ":memory:", columns = NULL, region = NULL, tidy_format = FALSE, threads = 1L, partition_by = NULL, overwrite = FALSE, config = list() )
file |
Path to VCF, VCF.GZ, or BCF file |
extension_path |
Path to the bcf_reader.duckdb_extension file. |
table_name |
Name for the table/view (default: "variants") |
as_view |
Logical, create a VIEW instead of materializing a TABLE (default: TRUE). Views are instant to create but queries re-read the VCF each time. Tables are slower to create but subsequent queries are fast. |
dbdir |
Database directory. Default ":memory:" for in-memory database. Use a file path for persistent storage (e.g., "variants.duckdb"). |
columns |
Optional character vector of columns to include. NULL for all. |
region |
Optional genomic region filter (e.g., "chr1:1000-2000"). Requires an indexed VCF. |
tidy_format |
Logical, if TRUE loads data in tidy (long) format with one row per variant-sample combination and a SAMPLE_ID column. Default FALSE. |
threads |
Number of threads for parallel loading (default: 1). When > 1 and VCF is indexed:
|
partition_by |
Optional character vector of columns to partition by when creating a table (ignored for views). Creates a partitioned table for efficient filtering. Only supported for file-backed databases. |
overwrite |
Logical, drop existing table/view if it exists (default: FALSE). |
config |
Named list of DuckDB configuration options. |
A list with:
con |
DuckDB connection with extension loaded |
table |
Name of the created table/view |
is_view |
Logical indicating if a view was created |
file |
Path to the source VCF file |
dbdir |
Database directory |
tidy_format |
Whether tidy format was used |
row_count |
Number of rows (NULL for views) |
## Not run: ext_path <- bcf_reader_build(tempdir()) # Open as lazy view (default - instant creation, re-reads VCF each query) vcf <- vcf_open_duckdb("variants.vcf.gz", ext_path) DBI::dbGetQuery(vcf$con, "SELECT * FROM variants WHERE CHROM = '22'") vcf_close_duckdb(vcf) # Parallel view (UNION ALL of per-contig reads, parallelized at query time) vcf <- vcf_open_duckdb("wgs.vcf.gz", ext_path, threads = 8) # Open as materialized table (slower to create, fast repeated queries) vcf <- vcf_open_duckdb("variants.vcf.gz", ext_path, as_view = FALSE) DBI::dbGetQuery(vcf$con, "SELECT COUNT(*) FROM variants") # Tidy format with specific columns vcf <- vcf_open_duckdb("cohort.vcf.gz", ext_path, tidy_format = TRUE, columns = c("CHROM", "POS", "REF", "ALT", "SAMPLE_ID", "FORMAT_GT") ) # Parallel table loading for large files vcf <- vcf_open_duckdb("wgs.vcf.gz", ext_path, as_view = FALSE, threads = 8) # Persistent file-backed database vcf <- vcf_open_duckdb("variants.vcf.gz", ext_path, dbdir = "my_variants.duckdb" ) # Partitioned table for efficient sample queries vcf <- vcf_open_duckdb("cohort.vcf.gz", ext_path, dbdir = "cohort.duckdb", tidy_format = TRUE, partition_by = "SAMPLE_ID" ) ## End(Not run)## Not run: ext_path <- bcf_reader_build(tempdir()) # Open as lazy view (default - instant creation, re-reads VCF each query) vcf <- vcf_open_duckdb("variants.vcf.gz", ext_path) DBI::dbGetQuery(vcf$con, "SELECT * FROM variants WHERE CHROM = '22'") vcf_close_duckdb(vcf) # Parallel view (UNION ALL of per-contig reads, parallelized at query time) vcf <- vcf_open_duckdb("wgs.vcf.gz", ext_path, threads = 8) # Open as materialized table (slower to create, fast repeated queries) vcf <- vcf_open_duckdb("variants.vcf.gz", ext_path, as_view = FALSE) DBI::dbGetQuery(vcf$con, "SELECT COUNT(*) FROM variants") # Tidy format with specific columns vcf <- vcf_open_duckdb("cohort.vcf.gz", ext_path, tidy_format = TRUE, columns = c("CHROM", "POS", "REF", "ALT", "SAMPLE_ID", "FORMAT_GT") ) # Parallel table loading for large files vcf <- vcf_open_duckdb("wgs.vcf.gz", ext_path, as_view = FALSE, threads = 8) # Persistent file-backed database vcf <- vcf_open_duckdb("variants.vcf.gz", ext_path, dbdir = "my_variants.duckdb" ) # Partitioned table for efficient sample queries vcf <- vcf_open_duckdb("cohort.vcf.gz", ext_path, dbdir = "cohort.duckdb", tidy_format = TRUE, partition_by = "SAMPLE_ID" ) ## End(Not run)
Enables SQL queries on VCF files using DuckDB. This allows powerful filtering, aggregation, and joining operations.
vcf_query_arrow(vcf_files, query, ...)vcf_query_arrow(vcf_files, query, ...)
vcf_files |
Character vector of VCF file paths |
query |
SQL query string. Use "vcf" as the table name. |
... |
Additional arguments passed to vcf_open_arrow |
Query result as a data frame
## Not run: # Count variants per chromosome vcf_query_arrow( "variants.vcf.gz", "SELECT CHROM, COUNT(*) as n FROM vcf GROUP BY CHROM" ) # Filter high-quality variants vcf_query_arrow( "variants.vcf.gz", "SELECT * FROM vcf WHERE QUAL > 30" ) # Join multiple VCF files vcf_query_arrow( c("sample1.vcf.gz", "sample2.vcf.gz"), "SELECT * FROM vcf WHERE POS BETWEEN 1000 AND 2000" ) ## End(Not run)## Not run: # Count variants per chromosome vcf_query_arrow( "variants.vcf.gz", "SELECT CHROM, COUNT(*) as n FROM vcf GROUP BY CHROM" ) # Filter high-quality variants vcf_query_arrow( "variants.vcf.gz", "SELECT * FROM vcf WHERE QUAL > 30" ) # Join multiple VCF files vcf_query_arrow( c("sample1.vcf.gz", "sample2.vcf.gz"), "SELECT * FROM vcf WHERE POS BETWEEN 1000 AND 2000" ) ## End(Not run)
Execute a SQL query against a VCF/BCF file using the bcf_reader extension.
The file is exposed as a table via the bcf_read() function.
vcf_query_duckdb( file, extension_path = NULL, query = NULL, region = NULL, tidy_format = FALSE, con = NULL )vcf_query_duckdb( file, extension_path = NULL, query = NULL, region = NULL, tidy_format = FALSE, con = NULL )
file |
Path to VCF, VCF.GZ, or BCF file |
extension_path |
Path to the bcf_reader.duckdb_extension file. |
query |
SQL query string. Use |
region |
Optional genomic region for indexed files (e.g., "chr1:1000-2000") |
tidy_format |
Logical, if TRUE returns data in tidy (long) format with one row per variant-sample combination and a SAMPLE_ID column. Default FALSE. |
con |
Optional existing DuckDB connection (with extension already loaded). If provided, extension_path is ignored. |
A data.frame with query results
## Not run: # First build the extension ext_path <- bcf_reader_build(tempdir()) # Basic query - get all variants vcf_query_duckdb("variants.vcf.gz", ext_path) # Count variants vcf_query_duckdb("variants.vcf.gz", ext_path, query = "SELECT COUNT(*) FROM bcf_read('{file}')" ) # Filter by chromosome vcf_query_duckdb("variants.vcf.gz", ext_path, query = "SELECT CHROM, POS, REF, ALT FROM bcf_read('{file}') WHERE CHROM = '22'" ) # Region query (requires index) vcf_query_duckdb("variants.vcf.gz", ext_path, region = "chr1:1000000-2000000") # Tidy format - one row per variant-sample vcf_query_duckdb("cohort.vcf.gz", ext_path, tidy_format = TRUE) # Reuse connection for multiple queries con <- vcf_duckdb_connect(ext_path) vcf_query_duckdb("file1.vcf.gz", con = con) vcf_query_duckdb("file2.vcf.gz", con = con) DBI::dbDisconnect(con, shutdown = TRUE) ## End(Not run)## Not run: # First build the extension ext_path <- bcf_reader_build(tempdir()) # Basic query - get all variants vcf_query_duckdb("variants.vcf.gz", ext_path) # Count variants vcf_query_duckdb("variants.vcf.gz", ext_path, query = "SELECT COUNT(*) FROM bcf_read('{file}')" ) # Filter by chromosome vcf_query_duckdb("variants.vcf.gz", ext_path, query = "SELECT CHROM, POS, REF, ALT FROM bcf_read('{file}') WHERE CHROM = '22'" ) # Region query (requires index) vcf_query_duckdb("variants.vcf.gz", ext_path, region = "chr1:1000000-2000000") # Tidy format - one row per variant-sample vcf_query_duckdb("cohort.vcf.gz", ext_path, tidy_format = TRUE) # Reuse connection for multiple queries con <- vcf_duckdb_connect(ext_path) vcf_query_duckdb("file1.vcf.gz", con = con) vcf_query_duckdb("file2.vcf.gz", con = con) DBI::dbDisconnect(con, shutdown = TRUE) ## End(Not run)
Opens a VCF file and parses VEP/BCSQ/ANN annotations into structured
columns. This is a convenience wrapper around vcf_open_arrow
with VEP parsing enabled.
vcf_read_vep(filename, vep_tag = NULL, vep_columns = NULL, ...)vcf_read_vep(filename, vep_tag = NULL, vep_columns = NULL, ...)
filename |
Path to VCF/BCF file |
vep_tag |
Annotation tag to parse ("CSQ", "BCSQ", "ANN") or NULL for auto-detection |
vep_columns |
Character vector of VEP fields to extract, or NULL for all fields |
... |
Additional arguments passed to |
Data frame with VCF columns plus parsed VEP fields as separate columns prefixed with the tag name (e.g., "CSQ_Consequence", "CSQ_SYMBOL", etc.)
## Not run: df <- vcf_read_vep("annotated.vcf.gz", vep_columns = c("Consequence", "SYMBOL", "AF", "gnomAD_AF") ) # Filter by gnomAD frequency rare <- df[!is.na(df$CSQ_gnomAD_AF) & df$CSQ_gnomAD_AF < 0.001, ] ## End(Not run)## Not run: df <- vcf_read_vep("annotated.vcf.gz", vep_columns = c("Consequence", "SYMBOL", "AF", "gnomAD_AF") ) # Filter by gnomAD frequency rare <- df[!is.na(df$CSQ_gnomAD_AF) & df$CSQ_gnomAD_AF < 0.001, ] ## End(Not run)
Extract sample names from FORMAT column names.
vcf_samples_duckdb(file, extension_path = NULL, con = NULL)vcf_samples_duckdb(file, extension_path = NULL, con = NULL)
file |
Path to VCF, VCF.GZ, or BCF file |
extension_path |
Path to the bcf_reader.duckdb_extension file. |
con |
Optional existing DuckDB connection (with extension loaded). |
Character vector of sample names
## Not run: ext_path <- bcf_reader_build(tempdir()) vcf_samples_duckdb("variants.vcf.gz", ext_path) ## End(Not run)## Not run: ext_path <- bcf_reader_build(tempdir()) vcf_samples_duckdb("variants.vcf.gz", ext_path) ## End(Not run)
Returns the column names and types for a VCF/BCF file as seen by DuckDB.
vcf_schema_duckdb(file, extension_path = NULL, tidy_format = FALSE, con = NULL)vcf_schema_duckdb(file, extension_path = NULL, tidy_format = FALSE, con = NULL)
file |
Path to VCF, VCF.GZ, or BCF file |
extension_path |
Path to the bcf_reader.duckdb_extension file. |
tidy_format |
Logical, if TRUE returns schema for tidy format. Default FALSE. |
con |
Optional existing DuckDB connection (with extension loaded). |
A data.frame with column_name and column_type
## Not run: ext_path <- bcf_reader_build(tempdir()) vcf_schema_duckdb("variants.vcf.gz", ext_path) # Compare wide vs tidy schemas vcf_schema_duckdb("cohort.vcf.gz", ext_path) # FORMAT_GT_Sample1, FORMAT_GT_Sample2... vcf_schema_duckdb("cohort.vcf.gz", ext_path, tidy_format = TRUE) # SAMPLE_ID, FORMAT_GT ## End(Not run)## Not run: ext_path <- bcf_reader_build(tempdir()) vcf_schema_duckdb("variants.vcf.gz", ext_path) # Compare wide vs tidy schemas vcf_schema_duckdb("cohort.vcf.gz", ext_path) # FORMAT_GT_Sample1, FORMAT_GT_Sample2... vcf_schema_duckdb("cohort.vcf.gz", ext_path, tidy_format = TRUE) # SAMPLE_ID, FORMAT_GT ## End(Not run)
Get summary statistics including variant counts per chromosome.
vcf_summary_duckdb(file, extension_path = NULL, con = NULL)vcf_summary_duckdb(file, extension_path = NULL, con = NULL)
file |
Path to VCF, VCF.GZ, or BCF file |
extension_path |
Path to the bcf_reader.duckdb_extension file. |
con |
Optional existing DuckDB connection (with extension loaded). |
A list with total_variants, n_samples, and variants_per_chrom
## Not run: ext_path <- bcf_reader_build(tempdir()) vcf_summary_duckdb("variants.vcf.gz", ext_path) ## End(Not run)## Not run: ext_path <- bcf_reader_build(tempdir()) vcf_summary_duckdb("variants.vcf.gz", ext_path) ## End(Not run)
Convenience function to read an entire VCF file into memory as an R data structure.
vcf_to_arrow(filename, as = c("tibble", "data.frame", "batches"), ...)vcf_to_arrow(filename, as = c("tibble", "data.frame", "batches"), ...)
filename |
Path to VCF or BCF file |
as |
Character string specifying output format: "tibble", "data.frame", or "batches" (list of nanoarrow arrays) |
... |
Additional arguments passed to vcf_open_arrow |
Depends on as parameter:
"tibble": A tibble
"data.frame": A data.frame
"batches": A list of nanoarrow_array objects
Converts a VCF/BCF file to Arrow IPC stream format for efficient storage and interoperability with Arrow-compatible tools. Uses nanoarrow's native IPC writer for streaming output.
vcf_to_arrow_ipc(input_vcf, output_ipc, ...)vcf_to_arrow_ipc(input_vcf, output_ipc, ...)
input_vcf |
Path to input VCF or BCF file |
output_ipc |
Path for output Arrow IPC file (typically .arrows extension) |
... |
Additional arguments passed to vcf_open_arrow |
Invisibly returns the output path
## Not run: vcf_to_arrow_ipc("variants.vcf.gz", "variants.arrows") # Read back with nanoarrow stream <- nanoarrow::read_nanoarrow("variants.arrows") df <- as.data.frame(stream) # Or query with DuckDB library(duckdb) con <- dbConnect(duckdb()) dbGetQuery(con, "SELECT * FROM 'variants.arrows' LIMIT 10") ## End(Not run)## Not run: vcf_to_arrow_ipc("variants.vcf.gz", "variants.arrows") # Read back with nanoarrow stream <- nanoarrow::read_nanoarrow("variants.arrows") df <- as.data.frame(stream) # Or query with DuckDB library(duckdb) con <- dbConnect(duckdb()) dbGetQuery(con, "SELECT * FROM 'variants.arrows' LIMIT 10") ## End(Not run)
Converts a VCF/BCF file to Apache Parquet format for efficient storage and querying with tools like DuckDB, Spark, or Python pandas/polars.
vcf_to_parquet_arrow( input_vcf, output_parquet, compression = "zstd", row_group_size = 100000L, streaming = FALSE, threads = 1L, index = NULL, ... )vcf_to_parquet_arrow( input_vcf, output_parquet, compression = "zstd", row_group_size = 100000L, streaming = FALSE, threads = 1L, index = NULL, ... )
input_vcf |
Path to input VCF or BCF file |
output_parquet |
Path for output Parquet file |
compression |
Compression codec: "snappy", "gzip", "zstd", "lz4", "uncompressed" |
row_group_size |
Number of rows per row group (default: 100000) |
streaming |
Use streaming mode for large files. When TRUE, writes to a temporary Arrow IPC file first (via nanoarrow), then converts to Parquet via DuckDB. This avoids loading the entire VCF into R memory. Requires the DuckDB nanoarrow community extension. Default is FALSE. |
threads |
Number of parallel threads for processing (default: 1).
When threads > 1 and file is indexed, uses parallel processing by splitting
work across chromosomes/contigs. Each thread processes different regions
simultaneously. Requires indexed file. See |
index |
Optional explicit index file path |
... |
Additional arguments passed to vcf_open_arrow |
Processing Modes:
Standard mode (streaming = FALSE, threads = 1): Loads entire VCF
into memory as data.frame before writing. Fast for small-medium files.
Streaming mode (streaming = TRUE, threads = 1): Two-stage streaming
via temporary Arrow IPC file. Minimal memory usage for large files.
Parallel mode (threads > 1): Requires indexed file. Splits work by
chromosomes, processing multiple regions simultaneously. Near-linear
speedup with thread count. Best for whole-genome VCFs.
Invisibly returns the output path
## Not run: # Standard mode (fast, loads into memory) vcf_to_parquet_arrow("variants.vcf.gz", "variants.parquet") # Streaming mode for large files (low memory) vcf_to_parquet_arrow("huge.vcf.gz", "huge.parquet", streaming = TRUE) # Parallel mode for whole-genome VCF (requires index) vcf_to_parquet_arrow("wgs.vcf.gz", "wgs.parquet", threads = 8) # Parallel + streaming for massive files vcf_to_parquet_arrow("wgs.vcf.gz", "wgs.parquet", threads = 16, streaming = TRUE) # With zstd compression vcf_to_parquet_arrow("variants.vcf.gz", "variants.parquet", compression = "zstd") # Query with DuckDB library(duckdb) con <- dbConnect(duckdb()) dbGetQuery(con, "SELECT CHROM, POS, REF FROM 'variants.parquet' WHERE CHROM = 'chr1'") ## End(Not run)## Not run: # Standard mode (fast, loads into memory) vcf_to_parquet_arrow("variants.vcf.gz", "variants.parquet") # Streaming mode for large files (low memory) vcf_to_parquet_arrow("huge.vcf.gz", "huge.parquet", streaming = TRUE) # Parallel mode for whole-genome VCF (requires index) vcf_to_parquet_arrow("wgs.vcf.gz", "wgs.parquet", threads = 8) # Parallel + streaming for massive files vcf_to_parquet_arrow("wgs.vcf.gz", "wgs.parquet", threads = 16, streaming = TRUE) # With zstd compression vcf_to_parquet_arrow("variants.vcf.gz", "variants.parquet", compression = "zstd") # Query with DuckDB library(duckdb) con <- dbConnect(duckdb()) dbGetQuery(con, "SELECT CHROM, POS, REF FROM 'variants.parquet' WHERE CHROM = 'chr1'") ## End(Not run)
Convert a VCF/BCF file to Parquet format for fast subsequent queries.
vcf_to_parquet_duckdb( input_file, output_file, extension_path = NULL, columns = NULL, region = NULL, compression = "zstd", row_group_size = 100000L, threads = 1L, tidy_format = FALSE, partition_by = NULL, include_metadata = TRUE, con = NULL )vcf_to_parquet_duckdb( input_file, output_file, extension_path = NULL, columns = NULL, region = NULL, compression = "zstd", row_group_size = 100000L, threads = 1L, tidy_format = FALSE, partition_by = NULL, include_metadata = TRUE, con = NULL )
input_file |
Path to input VCF, VCF.GZ, or BCF file |
output_file |
Path to output Parquet file or directory (when using partition_by) |
extension_path |
Path to the bcf_reader.duckdb_extension file. |
columns |
Optional character vector of columns to include. NULL for all. |
region |
Optional genomic region to export (requires index) |
compression |
Parquet compression: "snappy", "zstd", "gzip", or "none" |
row_group_size |
Number of rows per row group (default: 100000) |
threads |
Number of parallel threads for processing (default: 1).
When threads > 1 and file is indexed, uses parallel processing by splitting
work across chromosomes/contigs. See |
tidy_format |
Logical, if TRUE exports data in tidy (long) format with one row per variant-sample combination and a SAMPLE_ID column. Default FALSE. |
partition_by |
Optional character vector of columns to partition by (Hive-style).
Creates a directory structure like |
include_metadata |
Logical, if TRUE embeds the full VCF header as Parquet
key-value metadata. Default TRUE. This preserves all VCF schema information
(INFO, FORMAT, FILTER definitions, contigs, samples) enabling round-trip back
to VCF format. Use |
con |
Optional existing DuckDB connection (with extension loaded). |
Invisible path to output file/directory
## Not run: ext_path <- bcf_reader_build(tempdir()) # Export entire file with metadata vcf_to_parquet_duckdb("variants.vcf.gz", "variants.parquet", ext_path) # Read back the embedded metadata parquet_kv_metadata("variants.parquet") # Export specific columns vcf_to_parquet_duckdb("variants.vcf.gz", "variants_slim.parquet", ext_path, columns = c("CHROM", "POS", "REF", "ALT", "INFO_AF") ) # Export a region vcf_to_parquet_duckdb("variants.vcf.gz", "chr22.parquet", ext_path, region = "chr22" ) # Export in tidy format (one row per variant-sample) vcf_to_parquet_duckdb("cohort.vcf.gz", "cohort_tidy.parquet", ext_path, tidy_format = TRUE ) # Tidy format with Hive partitioning by SAMPLE_ID (efficient per-sample queries) vcf_to_parquet_duckdb("cohort.vcf.gz", "cohort_partitioned/", ext_path, tidy_format = TRUE, partition_by = "SAMPLE_ID" ) # Partition by both CHROM and SAMPLE_ID for large cohorts vcf_to_parquet_duckdb("wgs_cohort.vcf.gz", "wgs_partitioned/", ext_path, tidy_format = TRUE, partition_by = c("CHROM", "SAMPLE_ID") ) # Parallel mode for whole-genome VCF (requires index) vcf_to_parquet_duckdb("wgs.vcf.gz", "wgs.parquet", ext_path, threads = 8) ## End(Not run)## Not run: ext_path <- bcf_reader_build(tempdir()) # Export entire file with metadata vcf_to_parquet_duckdb("variants.vcf.gz", "variants.parquet", ext_path) # Read back the embedded metadata parquet_kv_metadata("variants.parquet") # Export specific columns vcf_to_parquet_duckdb("variants.vcf.gz", "variants_slim.parquet", ext_path, columns = c("CHROM", "POS", "REF", "ALT", "INFO_AF") ) # Export a region vcf_to_parquet_duckdb("variants.vcf.gz", "chr22.parquet", ext_path, region = "chr22" ) # Export in tidy format (one row per variant-sample) vcf_to_parquet_duckdb("cohort.vcf.gz", "cohort_tidy.parquet", ext_path, tidy_format = TRUE ) # Tidy format with Hive partitioning by SAMPLE_ID (efficient per-sample queries) vcf_to_parquet_duckdb("cohort.vcf.gz", "cohort_partitioned/", ext_path, tidy_format = TRUE, partition_by = "SAMPLE_ID" ) # Partition by both CHROM and SAMPLE_ID for large cohorts vcf_to_parquet_duckdb("wgs_cohort.vcf.gz", "wgs_partitioned/", ext_path, tidy_format = TRUE, partition_by = c("CHROM", "SAMPLE_ID") ) # Parallel mode for whole-genome VCF (requires index) vcf_to_parquet_duckdb("wgs.vcf.gz", "wgs.parquet", ext_path, threads = 8) ## End(Not run)
Processes VCF/BCF file in parallel by splitting work across chromosomes/contigs using the DuckDB bcf_reader extension. Requires an indexed file. Each thread processes a different chromosome, then results are merged into a single Parquet file.
vcf_to_parquet_duckdb_parallel( input_file, output_file, extension_path = NULL, threads = parallel::detectCores(), compression = "zstd", row_group_size = 100000L, columns = NULL, tidy_format = FALSE, partition_by = NULL, con = NULL )vcf_to_parquet_duckdb_parallel( input_file, output_file, extension_path = NULL, threads = parallel::detectCores(), compression = "zstd", row_group_size = 100000L, columns = NULL, tidy_format = FALSE, partition_by = NULL, con = NULL )
input_file |
Path to input VCF/BCF file (must be indexed) |
output_file |
Path for output Parquet file |
extension_path |
Path to the bcf_reader.duckdb_extension file. |
threads |
Number of parallel threads (default: auto-detect) |
compression |
Parquet compression codec |
row_group_size |
Row group size |
columns |
Optional character vector of columns to include |
tidy_format |
Logical, if TRUE exports data in tidy (long) format. Default FALSE. |
partition_by |
Optional character vector of columns to partition by (Hive-style).
Creates directory structure like |
con |
Optional existing DuckDB connection (with extension loaded). |
This function:
Checks for index (required for parallel processing)
Extracts contig names from header
Processes each contig in parallel using multiple R processes
Writes each contig to a temporary Parquet file
Merges all temporary files into final output using DuckDB
Contigs that return no variants are skipped automatically.
When partition_by is specified, the function creates a Hive-partitioned directory
structure. This is especially useful with tidy_format = TRUE and
partition_by = "SAMPLE_ID" for efficient per-sample queries on large cohorts.
DuckDB auto-generates Bloom filters for VARCHAR columns like SAMPLE_ID.
Invisibly returns the output path
vcf_to_parquet_duckdb for single-threaded conversion
## Not run: ext_path <- bcf_reader_build(tempdir()) # Use 8 threads vcf_to_parquet_duckdb_parallel("wgs.vcf.gz", "wgs.parquet", ext_path, threads = 8) # With specific columns vcf_to_parquet_duckdb_parallel( "wgs.vcf.gz", "wgs.parquet", ext_path, threads = 16, columns = c("CHROM", "POS", "REF", "ALT") ) # Tidy format output vcf_to_parquet_duckdb_parallel("wgs.vcf.gz", "wgs_tidy.parquet", ext_path, threads = 8, tidy_format = TRUE ) # Tidy format with Hive partitioning by SAMPLE_ID vcf_to_parquet_duckdb_parallel("wgs_cohort.vcf.gz", "wgs_partitioned/", ext_path, threads = 8, tidy_format = TRUE, partition_by = "SAMPLE_ID" ) ## End(Not run)## Not run: ext_path <- bcf_reader_build(tempdir()) # Use 8 threads vcf_to_parquet_duckdb_parallel("wgs.vcf.gz", "wgs.parquet", ext_path, threads = 8) # With specific columns vcf_to_parquet_duckdb_parallel( "wgs.vcf.gz", "wgs.parquet", ext_path, threads = 16, columns = c("CHROM", "POS", "REF", "ALT") ) # Tidy format output vcf_to_parquet_duckdb_parallel("wgs.vcf.gz", "wgs_tidy.parquet", ext_path, threads = 8, tidy_format = TRUE ) # Tidy format with Hive partitioning by SAMPLE_ID vcf_to_parquet_duckdb_parallel("wgs_cohort.vcf.gz", "wgs_partitioned/", ext_path, threads = 8, tidy_format = TRUE, partition_by = "SAMPLE_ID" ) ## End(Not run)
Processes VCF/BCF file in parallel by splitting work across chromosomes/contigs. Requires an indexed file. Each thread processes a different chromosome, then results are merged into a single Parquet file.
vcf_to_parquet_parallel_arrow( input_vcf, output_parquet, threads = parallel::detectCores(), compression = "zstd", row_group_size = 100000L, streaming = FALSE, index = NULL, ... )vcf_to_parquet_parallel_arrow( input_vcf, output_parquet, threads = parallel::detectCores(), compression = "zstd", row_group_size = 100000L, streaming = FALSE, index = NULL, ... )
input_vcf |
Path to input VCF/BCF file (must be indexed) |
output_parquet |
Path for output Parquet file |
threads |
Number of parallel threads (default: auto-detect) |
compression |
Compression codec |
row_group_size |
Row group size |
streaming |
Use streaming mode |
index |
Optional explicit index path |
... |
Additional arguments passed to vcf_open_arrow |
This function:
Checks for index (required for parallel processing)
Extracts contig names from header
Processes each contig in parallel using multiple R processes
Writes each contig to a temporary Parquet file
Merges all temporary files into final output using DuckDB
Contigs that return no variants are skipped automatically.
Invisibly returns the output path
## Not run: # Use 8 threads vcf_to_parquet_parallel_arrow("wgs.vcf.gz", "wgs.parquet", threads = 8) # With streaming mode for large files vcf_to_parquet_parallel_arrow( "huge.vcf.gz", "huge.parquet", threads = 16, streaming = TRUE ) ## End(Not run)## Not run: # Use 8 threads vcf_to_parquet_parallel_arrow("wgs.vcf.gz", "wgs.parquet", threads = 8) # With streaming mode for large files vcf_to_parquet_parallel_arrow( "huge.vcf.gz", "huge.parquet", threads = 16, streaming = TRUE ) ## End(Not run)
Checks for the presence of CSQ, BCSQ, or ANN annotation tags in the VCF header and returns the first one found.
vep_detect_tag(filename)vep_detect_tag(filename)
filename |
Path to VCF/BCF file |
Character string with tag name ("CSQ", "BCSQ", or "ANN"), or NA if no annotation found
## Not run: vep_detect_tag("annotated.vcf.gz") # Returns "CSQ" ## End(Not run)## Not run: vep_detect_tag("annotated.vcf.gz") # Returns "CSQ" ## End(Not run)
Parses the VEP/BCSQ/ANN header to extract field names and inferred types. Types are inferred using bcftools split-vep conventions.
vep_get_schema(filename, tag = NULL)vep_get_schema(filename, tag = NULL)
filename |
Path to VCF/BCF file |
tag |
Optional annotation tag ("CSQ", "BCSQ", "ANN"). If NULL (default), auto-detects. |
Data frame with columns:
Field name (e.g., "Consequence", "SYMBOL", "AF")
Inferred type ("Integer", "Float", "String")
Position in pipe-delimited string (0-based)
Whether field can have multiple values
The tag name is stored as an attribute.
## Not run: schema <- vep_get_schema("vep_annotated.vcf.gz") print(schema) # name type index is_list # 1 Allele String 0 FALSE # 2 Consequence String 1 TRUE # 3 IMPACT String 2 FALSE # ... attr(schema, "tag") # "CSQ" ## End(Not run)## Not run: schema <- vep_get_schema("vep_annotated.vcf.gz") print(schema) # name type index is_list # 1 Allele String 0 FALSE # 2 Consequence String 1 TRUE # 3 IMPACT String 2 FALSE # ... attr(schema, "tag") # "CSQ" ## End(Not run)
Check if VCF has VEP-style annotations
vep_has_annotation(filename)vep_has_annotation(filename)
filename |
Path to VCF/BCF file |
Logical indicating presence of CSQ, BCSQ, or ANN
## Not run: if (vep_has_annotation("file.vcf.gz")) { schema <- vep_get_schema("file.vcf.gz") } ## End(Not run)## Not run: if (vep_has_annotation("file.vcf.gz")) { schema <- vep_get_schema("file.vcf.gz") } ## End(Not run)
Uses bcftools split-vep conventions to infer the type of a VEP field from its name.
vep_infer_type(field_name)vep_infer_type(field_name)
field_name |
Character vector of field names |
Known integer fields: DISTANCE, STRAND, TSL, GENE_PHENO, HGVS_OFFSET, MOTIF_POS, existing_ORFs, SpliceAI_pred_DP_
Known float fields: AF, AF (e.g., gnomAD_AF), MAX_AF, MOTIF_SCORE_CHANGE, SpliceAI_pred_DS_*
All others default to String.
Character vector of inferred types ("Integer", "Float", "String")
vep_infer_type(c("SYMBOL", "AF", "gnomAD_AF", "DISTANCE", "SpliceAI_pred_DS_AG")) # [1] "String" "Float" "Float" "Integer" "Float"vep_infer_type(c("SYMBOL", "AF", "gnomAD_AF", "DISTANCE", "SpliceAI_pred_DS_AG")) # [1] "String" "Float" "Float" "Integer" "Float"
Convenience function to display available VEP fields and their types.
vep_list_fields(filename)vep_list_fields(filename)
filename |
Path to VCF/BCF file |
Invisibly returns the schema data frame
## Not run: vep_list_fields("annotated.vcf.gz") # VEP Annotation Tag: CSQ # Fields (78 total): # 1. Allele (String) # 2. Consequence (String, list) # 3. IMPACT (String) # ... ## End(Not run)## Not run: vep_list_fields("annotated.vcf.gz") # VEP Annotation Tag: CSQ # Fields (78 total): # 1. Allele (String) # 2. Consequence (String, list) # 3. IMPACT (String) # ... ## End(Not run)
Parses a CSQ/BCSQ/ANN annotation string into a structured list of data frames, one per transcript/consequence.
vep_parse_record(csq_value, filename, schema = NULL)vep_parse_record(csq_value, filename, schema = NULL)
csq_value |
Raw annotation string (pipe-delimited, comma-separated for multiple transcripts) |
filename |
Path to VCF file (for schema extraction) |
schema |
Optional pre-parsed schema from |
List of data frames, one per transcript. Each data frame has one row with columns corresponding to annotation fields, properly typed.
## Not run: # Get a CSQ value from a VCF csq <- "A|missense_variant|MODERATE|BRCA1|..." result <- vep_parse_record(csq, "annotated.vcf.gz") result[[1]]$Consequence # "missense_variant" result[[1]]$AF # 0.001 (numeric) ## End(Not run)## Not run: # Get a CSQ value from a VCF csq <- "A|missense_variant|MODERATE|BRCA1|..." result <- vep_parse_record(csq, "annotated.vcf.gz") result[[1]]$Consequence # "missense_variant" result[[1]]$AF # 0.001 (numeric) ## End(Not run)