Pitfalls of direct IO with block devices?

Question

I'm building a database on top of io_uring and the NVMe API. I need a place to store seldomly used append like records (older parts of message queues, columnar tables that has been already aggregated, old WAL blocks for potential restoring....) and I was thinking of adding HDDs to the storage pool mix to save money.

The server on which I'm experimenting with is: bare metal, very modern linux kernel (needed for io_uring), 128 GB RAM, 24 threads, 2* 2 TB NVMe, 14* 22 TB SATA HDD.

At the moment my approach is:

No filesystem, use Direct IO on the block device
Store metadata in RAM for fast lookup
Use NVMe to persist metadata and act as a writeback cache
Use 16 MB block size

It honestly looks really effective:

The NVMe cache allows me to saturate the 50 gbps downlink without problems, unlike current linux cache solutions (bcache, LVM cache, ...)
When data touches the HDDs it has already been compactified, so it's just a bunch of large linear writes and reads
I get the REAL read benefits of RAID1, as I can stripe read access across drives(/nodes)

Anyhow, while I know the NVMe spec to the core, I'm unfamiliar with using HDDs as plain block devices without a FS. My questions are:

Are there any pitfalls I'm not considering?
Is there a reason why I should prefer using an FS for my use case?

Just for my curiosity: "I'm building a database on top of io_uring and the NVMe API." Is this in competition or in complementary intent to PostgreSQL's work in that direction? (I hope you're aware of their results on that, which aren't too great for io_uring, partially because a database payload is, to some degree, inherently synchronous in a lot of ways, due to consistency requirements) — Marcus Müller
– Marcus Müller, Commented Dec 8 at 11:53
(note that postgres – as far as I know – is based on file IO, not raw block devices, because their benchmarks say that very few things are better at offering "different pieces of data in a list" abstractions than actual files on an actual file system) — Marcus Müller
– Marcus Müller, Commented Dec 8 at 11:55
@MarcusMüller I would say it's a competition. To sum briefly a topic I could talk for hours about: PSQL is an old design on which io_uring has been bolted on, I take a RADICALLY different approach where I design from the ground thinking about modern hardware and software. — Mascarpone
– Mascarpone, Commented Dec 8 at 13:05
@MarcusMüller Yeah, if you use O_DIRECT|O_SYNC for file I/O then you're bypassing caches and getting a sync after each write, so you're kinda getting the same performance as direct raw block writes (just a level of indirection that's normally all in memory). Things get even more complicated if your backend device is raid or LVM. Back in the 90s, Oracle DBAs would tune their datafiles and tablespaces to be split across devices to help maximize I/O; today the OS does a much better job. And when you're talking about VMs writing to a SAN, all "local optimisations" are just off the books. — Stephen Harris
– Stephen Harris, Commented Dec 9 at 2:31

Stack Exchange Network

Pitfalls of direct IO with block devices?

0

You must log in to answer this question.

Hot Network Questions

Pitfalls of direct IO with block devices?

0

You must log in to answer this question.

Related

Hot Network Questions