3

I'm building a database on top of io_uring and the NVMe API. I need a place to store seldomly used append like records (older parts of message queues, columnar tables that has been already aggregated, old WAL blocks for potential restoring....) and I was thinking of adding HDDs to the storage pool mix to save money.

The server on which I'm experimenting with is: bare metal, very modern linux kernel (needed for io_uring), 128 GB RAM, 24 threads, 2* 2 TB NVMe, 14* 22 TB SATA HDD.

At the moment my approach is:

  • No filesystem, use Direct IO on the block device
  • Store metadata in RAM for fast lookup
  • Use NVMe to persist metadata and act as a writeback cache
  • Use 16 MB block size

It honestly looks really effective:

  • The NVMe cache allows me to saturate the 50 gbps downlink without problems, unlike current linux cache solutions (bcache, LVM cache, ...)
  • When data touches the HDDs it has already been compactified, so it's just a bunch of large linear writes and reads
  • I get the REAL read benefits of RAID1, as I can stripe read access across drives(/nodes)

Anyhow, while I know the NVMe spec to the core, I'm unfamiliar with using HDDs as plain block devices without a FS. My questions are:

  • Are there any pitfalls I'm not considering?
  • Is there a reason why I should prefer using an FS for my use case?
New contributor
Mascarpone is a new contributor to this site. Take care in asking for clarification, commenting, and answering. Check out our Code of Conduct.
5
  • 1
    Just for my curiosity: "I'm building a database on top of io_uring and the NVMe API." Is this in competition or in complementary intent to PostgreSQL's work in that direction? (I hope you're aware of their results on that, which aren't too great for io_uring, partially because a database payload is, to some degree, inherently synchronous in a lot of ways, due to consistency requirements) Commented Dec 8 at 11:53
  • 2
    (note that postgres – as far as I know – is based on file IO, not raw block devices, because their benchmarks say that very few things are better at offering "different pieces of data in a list" abstractions than actual files on an actual file system) Commented Dec 8 at 11:55
  • 1
    pganalyze.com/blog/postgres-18-async-io Commented Dec 8 at 11:58
  • 1
    @MarcusMüller I would say it's a competition. To sum briefly a topic I could talk for hours about: PSQL is an old design on which io_uring has been bolted on, I take a RADICALLY different approach where I design from the ground thinking about modern hardware and software. Commented Dec 8 at 13:05
  • 1
    @MarcusMüller Yeah, if you use O_DIRECT|O_SYNC for file I/O then you're bypassing caches and getting a sync after each write, so you're kinda getting the same performance as direct raw block writes (just a level of indirection that's normally all in memory). Things get even more complicated if your backend device is raid or LVM. Back in the 90s, Oracle DBAs would tune their datafiles and tablespaces to be split across devices to help maximize I/O; today the OS does a much better job. And when you're talking about VMs writing to a SAN, all "local optimisations" are just off the books. Commented Dec 9 at 2:31

0

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.