Parquet in Rust
One of my projects at work has lead me to dig into processing some large data in Parquet format, so I spent some time figuring out how to do so in Rust using parquet-rs. Below is a simple example of how to create data in Python and read it in with Rust. My current use case has little more than very simple Parquet – no concern with indexes, partitioning, etc.; just simple row-wise processing of compressed, structured data.
Create Dataset
First, I created a simple dataset in Python+Pandas:
import pandas as pd
people = { 'name': ['Alice', 'Bob', 'Charlie', 'Dave'],
'age': [35, 32, 41, 27],
'height': [ 178.6, 175.3, 177, 175 ],
'languages': [['English', 'French'], ['English'], [], ['Python', 'Rust']],
'is_employed': [True, True, False, True],
}
df = pd.DataFrame(people)
df.to_parquet('my_dataset.parquet')
This dataset looks like this from Python:
>>> df
name age height languages is_employed
0 Alice 35 178.6 [English, French] True
1 Bob 32 175.3 [English] True
2 Charlie 41 177.0 [] False
3 Dave 27 175.0 [Python, Rust] True
Setup Rust Project
Create a new Rust project with cargo new parquet_test
. parquet-rs takes a dependency on nightly, so specify the override: rustup override set nightly
.
Add to your Cargo.toml file a dependency of parquet = "0.16"
.
Rust code
use std::fs::File;
extern crate parquet;
use parquet::file::reader::SerializedFileReader;
use parquet::record::{ListAccessor, RowAccessor};
fn main() {
let filename = "my_dataset.parquet";
let fh = File::open(filename).unwrap();
let reader: SerializedFileReader<File> = SerializedFileReader::new(fh).unwrap();
let mut lines = 0;
for row in reader.into_iter() {
lines += 1;
println!("Row {}", lines);
let name = row.get_string(0).unwrap();
let age = row.get_long(1).unwrap();
// Appropriate handling of null input values:
let height = if let Ok(height_val) = row.get_double(2) {
height_val
} else {
-9999.
};
let is_employed = row.get_bool(4).unwrap();
println!(
" Name={}, age={}, height={}, is_employed={}",
name, age, height, is_employed
);
let languages: &parquet::record::List = row.get_list(3).unwrap();
if languages.len() == 0 {
println!("Languages: (none)");
} else {
let joined_langs = (0..languages.len())
.map(|i| languages.get_string(i).unwrap().to_string())
.collect::<Vec<_>>()
.join(", ");
println!(" Languages: {}", joined_langs);
}
println!("---");
}
println!("Read {} records from {}", lines, filename);
}