Parquet in Rust: Reading the Schema
Following-up on my previous post about reading Parquet files in Rust, I spent some time looking through the parquet
crate’s documentation for how to get the schema. Once I distilled it down, it’s actually a lot simpler than I expected:
use std::fs::File;
extern crate parquet;
use parquet::file::reader::{FileReader, SerializedFileReader};
use parquet::schema::types::Type;
fn main() -> Result<(), std::io::Error> {
let filename = "my_dataset.parquet";
let fh = File::open(filename)?;
let reader: SerializedFileReader<File> = SerializedFileReader::new(fh)?;
let schema: &parquet::schema::types::Type = reader.metadata().file_metadata().schema();
// recursively display the schema (because a type can be a list of other types)
display(schema, 0);
Ok(())
}
fn display(schema: &Type, depth: usize) {
let name = schema.name();
let indent = " ".repeat(4 * depth);
match schema {
Type::PrimitiveType { physical_type, .. } => println!("{}{} : {}", indent, name, physical_type),
Type::GroupType { .. } => println!("{}{} is a list type", indent, name),
}
// this type is a list of other types
if schema.is_group() {
for column in schema.get_fields() {
display(column, depth + 1);
}
}
}
Running this code generates the following output for the sample dataset:
schema is a list type
name : BYTE_ARRAY
age : INT64
height : DOUBLE
languages is a list type
list is a list type
item : BYTE_ARRAY
is_employed : BOOLEAN