parquet_to_vcf() - Convert Parquet files back to VCF/VCF.GZ/BCF format:
index parameter)vcf_to_parquet_duckdb() now embeds the full VCF header as Parquet key-value metadata by default:
include_metadata = TRUE (default) stores the complete VCF header in the Parquet filetidy_format flag indicating data layout ("true" or "false")parquet_kv_metadata(file) to read the header back from Parquetpartition_by (Parquet limitation for partitioned writes)New helper functions:
vcf_header_metadata(file) - Extract full VCF header and package versionparquet_kv_metadata(file) - Read key-value metadata from Parquet filesvcf_open_duckdb()**: Open VCF/BCF files as DuckDB tables or views
as_view = TRUE (default) creates instant views that re-read VCF on each queryas_view = FALSE materializes data to a table for fast repeated queriestidy_format = TRUE for one row per variant-sample with SAMPLE_ID columncolumns parameter for selecting specific columnsthreads parameter for parallel loading (requires indexed VCF):
partition_by for creating partitioned tablesvcf_duckdb object with connection, table name, and metadatavcf_close_duckdb() for proper cleanupC-level tidy_format parameter The DuckDB bcf_reader extension now supports native tidy format output directly at the C level, emitting one row per variant-sample combination with a SAMPLE_ID column
tidy_format = TRUE parameterUpdated R wrapper functions with tidy_format parameter:
vcf_query_duckdb(..., tidy_format = TRUE) - query in tidy formatvcf_count_duckdb(..., tidy_format = TRUE) - count variant-sample rowsvcf_schema_duckdb(..., tidy_format = TRUE) - show tidy schemavcf_to_parquet_duckdb(..., tidy_format = TRUE) - export in tidy formatvcf_to_parquet_duckdb_parallel(..., tidy_format = TRUE) - parallel tidy exportducklake_load_vcf(..., tidy_format = TRUE) - load VCF in tidy format to DuckLakeRemoved SQL-based tidy functions (replaced by native tidy_format parameter):
vcf_to_parquet_tidy()vcf_to_parquet_tidy_parallel()build_tidy_sql() helperpartition_by parameter for efficient per-sample queries on large cohorts:
vcf_to_parquet_duckdb(..., partition_by = "SAMPLE_ID") - create Hive-partitioned directoryvcf_to_parquet_duckdb_parallel(..., partition_by = "SAMPLE_ID") - parallel partitioned exportducklake_load_vcf(..., partition_by = "SAMPLE_ID") - load partitioned VCF to DuckLakeoutput_dir/SAMPLE_ID=HG00098/data_0.parquetpartition_by = c("CHROM", "SAMPLE_ID")allow_evolution parameter for ducklake_load_vcf() and ducklake_register_parquet() to auto-add new columns via ALTER TABLEducklake_snapshots(): list snapshot historyducklake_current_snapshot(): get current snapshot IDducklake_set_commit_message(): set author/message for transactionsducklake_options(): get DuckLake configurationducklake_set_option(): set compression, row group size, etc.ducklake_query_snapshot(): time travel queries at specific versionsducklake_list_files(): list Parquet files managed by DuckLakeducklake_merge(): upsert data using MERGE INTO syntaxvcf_query to vcf_query_arrow and vcf_to_parquet to vcf_to_parquetbcftools score pluginint64_t format specifier in bcf_reader extension for macOS arm64 compatibility (use PRId64 from <inttypes.h> instead of %ld)DYLD_LIBRARY_PATH in subprocessesvcf_query to vcf_query_arrow and vcf_to_parquet to vcf_to_parquetDuckLake catalog connection abstraction: Support for DuckDB, SQLite, PostgreSQL, MySQL backends
ducklake_connect_catalog(): Abstracted connection function for multiple catalog backendsducklake_create_catalog_secret(): Create catalog secrets for credential managementducklake_list_secrets(): List existing catalog secretsducklake_drop_secret(): Remove catalog secretsducklake_update_secret(): Update existing catalog secretsducklake_parse_connection_string(): Parse DuckLake connection stringsDuckDB bcf_reader extension: Native DuckDB table function for querying VCF/BCF files directly.
bcf_reader_build(): Build extension from source using package's bundled htslibvcf_duckdb_connect(): Create DuckDB connection with extension loadedvcf_query_duckdb(): Query VCF/BCF files with SQLDuckDB bcf_reader extension now auto-parses VEP-style annotations (INFO/CSQ, INFO/BCSQ, INFO/ANN) into typed VEP_* columns with all
transcripts preserved as lists (using a vendored parser); builds remain self-contained with packaged htslib.
Arrow VCF stream (nanoarrow) now aligns VEP parsing semantics with DuckDB (schema and typing improvements; transcript handling under active development).
Parallel (contig-based) DuckDB extension Parquet converter.
Package version reflects bundled htslib/bcftools versions.
to parquet conversion now support parrallel threading based conversion
vcf2parquet.R script in inst/
VCF to Arrow streaming via nanoarrow (no arrow package required):
vcf_open_arrow(): Open VCF/BCF as Arrow array streamvcf_to_arrow(): Convert to data.frame/tibble/batchesvcf_to_parquet(): Export to Parquet format via DuckDBvcf_to_arrow_ipc(): Export to Arrow IPC format (streaming, no memory overhead)vcf_query(): SQL queries on VCF files via DuckDBStreaming mode for large files: vcf_to_parquet(..., streaming = TRUE)
streams VCF -> Arrow IPC -> Parquet without loading into R memory.
Requires DuckDB nanoarrow extension (auto-installed on first use).
INFO and FORMAT field extraction:
INFO data.frame columnsamples data.frame with sample names as columnsbcf_hdr_check_sanity())bundles htslib/bcftools cli and libraries