A faster way to copy SQLite databases between computers

200 comments

·May 1, 2025

bambax

> If it takes a long time to copy a database and it gets updated midway through, rsync may give me an invalid database file. The first half of the file is pre-update, the second half file is post-update, and they don’t match. When I try to open the database locally, I get an error

Of course! You can't copy the file of a running, active db receiving updates, that can only result in corruption.

For replicating sqlite databases safely there is

https://github.com/benbjohnson/litestream

rsync

"For replicating sqlite databases safely there is (litestream) ..."

A reminder that litestream can run over plain old SFTP[1] which means you can stream database replication to just about any UNIX endpoint over SSH.

I have a favorite[2] but any SFTP server will do ...

[1] https://github.com/benbjohnson/litestream/issues/140

[2] https://www.rsync.net/resources/notes/2021-q3-rsync.net_tech...

aitchnyu

Whats the closest counterpart in Postgres ecosystem?

j16sdiz

log shipping[1] or logical replication[2]

[1] https://www.postgresql.org/docs/current/warm-standby.html [2] https://www.postgresql.org/docs/current/logical-replication....

rsync

This isn't what you mean but I can't resist:

  pg_dump -U postgres db | ssh user@rsync.net "dd of=db_dump"

creatonez

> You can't copy the file of a running, active db receiving updates, that can only result in corruption

To push back against "only" -- there is actually one scenario where this works. Copying a file or a subvolume on Btrfs or ZFS can be done atomically, so if it's an ACID database or an LSM tree, in the worst case it will just rollback. Of course, if it's multiple files you have to take care to wrap them in a subvolume so that all of them are copied in the same transaction, simply using `cp --reflink=always` won't do.

Possibly freezing the process with SIGSTOP would yield the same result, but I wouldn't count on that

lmz

It can't be done without fs specific snapshots - otherwise how would it distinguish between a cp/rsync needing consistent reads vs another sqlite client wanting the newest data?

ummonk

I would assume cp uses ioctl (with atomic copies of individual files on filesystems that support CoW like APFS and BTRFS), whereas sqlite probably uses mmap?

o11c

Obligatory "LVM still exists and snapshots are easy enough to overprovision for"

wswope

The built-in .backup command is also intended as an official tool for making “snapshotted” versions of a live db that can be copied around.

remram

This leverages a dedicated on-line backup API: https://sqlite.org/backup.html

lknuth

While I run and love litestream on my own system, I also like that they have a pretty comprehensive guide on how to do something like this manually, via built-in tools: https://litestream.io/alternatives/cron/

yard2010

Litestream is really cool! I'm planning to use it to backup and restore my SQLite in the container level, just like what that ex-google guy who started a startup of a small KVM and had a flood in his warehouse while on vacation did. If I'm not mistaken. I would link here the perfect guide he wrote but there's 0 chance I'll find it. If you understand the reference please post the link.

mtlynch

Haha, that sounds like me. Here's the writeup you're talking about:

https://mtlynch.io/litestream/

And here's the flooding story:

https://mtlynch.io/solo-developer-year-6/#the-most-terrifyin...

Sidenote: I still use Litestream in every project where I use SQLite.

yellow_lead

Litestream looks interesting but they are still in beta, and seem to have not had a release in over a year, although SQLite doesn't move that quickly.

Is Litestream still an active project?

clintonb

Despite the beta label and lack of a 1.x release, I would consider the project pretty stable. We've used it in production for over 18 months to support an offline-first point of sale system. We haven't had any issues with Litestream.

yellow_lead

That's great, thanks for sharing!

acrispino

seems like a new release is being worked on: https://github.com/benbjohnson/litestream/pull/636

pixl97

>You can't copy the file of a running, active db receiving updates, that can only result in corruption

There is a slight 'well akshully' on this. A DB flush and FS snapshot where you copy the snapshotted file will allow this. MSSQL VSS snapshots would be an example of this.

tpmoney

Similarly you can rsync a Postgres data directory safely while the db is running, with the caveat that you likely lose any data written while the rsync is running. And if you want that data, you can get it with the WAL files.

It’s been years since I needed to do this, but if I remember right, you can clone an entire pg db live with a `pg_backup_start()`, rsync the data directory, pg_backup_stop() and rsync the WAL files written since backup start.

edoceo

For moving DBs where I'm allowed minutes of downtime I do rsync (slow) first from the live, while hot, then just stop that one, then rsync again (fast) then make the new one hot.

Works a treat when other (better) method are not available.

bob1029

MSSQL also offers a virtual backup device interface that 3rd party tools can implement.

https://learn.microsoft.com/en-us/sql/relational-databases/b...

gerdesj

"If it takes a long time to copy a database and it gets updated midway through, rsync may give me an invalid database file"

Wot? There are multiple ways of snapshotting/checkpointing, starting at the virty level and working on down the stack through the application level.

zeroq

How to copy databases between computers? Just send a circle and forget about the rest of the owl.

As others have mentioned an incremental rsync would be much faster, but what bothers me the most is that he claims that sending SQL statements is faster than sending database and COMPLETELY omiting the fact that you have to execute these statements. And then run /optimize/. And then run /vacuum/.

Currently I have scenario in which I have to "incrementally rebuild *" a database from CSV files. While in my particular case recreating the database from scratch is more optimal - despite heavy optimization it still takes half an hour just to run batch inserts on an empty database in memory, creating indexes, etc.

iveqy

I hope you've found https://stackoverflow.com/questions/1711631/improve-insert-p...

It's a very good writeup on how to do fast inserts in sqlite3

zeroq

Yes! That was actually quite helpful.

For my use case (recreating in-memory from scratch) it basically boils down to three points: (1) journal_mode = off (2) wrapping all inserts in a single transaction (3) indexes after inserts.

For whatever it's worth I'm getting 15M inserts per minute on average, and topping around 450k/s for trivial relationship table on a stock Ryzen 5900X using built-in sqlite from NodeJS.

vlovich123

Would it be useful for you to have a SQL database that’s like SQLite (single file but not actually compatible with the SQLite file format) but can do 100M/s instead?

o11c

It's worth noting that the data in that benchmark is tiny (28MB). While this varies between database engines, "one transaction for everything" means keeping some kind of allocations alive.

The optimal transaction size is difficult to calculate so should be measured, but it's almost certainly never beneficial to spend multiple seconds on a single transaction.

There will also be weird performance changes when the size of data (or indexed data) exceeds the size of main memory.

jgalt212

yes, but they punt on this issue:

CREATE INDEX then INSERT vs. INSERT then CREATE INDEX

i.e. they only time INSERTs, not the CREATE INDEX after all the INSERTs.

null

[deleted]

gibibit

Hilarious, 3000+ votes for a Stack Overflow question that's not a question. But it is an interesting article. Interesting enough that it gets to break all the rules, I guess?

detaro

It's a (quite old) community wiki post. These do (and especially did back then) work and are treated differently.

stackskipton

As with any optimization, it matters where your bottleneck is here. Sounds like theirs is bandwidth but CPU/Disk IO is plentiful since they mentioned that downloading 250MB database takes minute where I just grabbed 2GB SQLite test database from work server in 15 seconds thanks to 1Gbps fiber.

JamesonNetworks

30 minutes seems long. Is there a lot of data? I’ve been working on bootstrapping sqlite dbs off of lots of json data and by holding a list of values and then inserting 10k at a time with inserts, Ive found a good perf sweet spot where I can insert plenty of rows (millions) in minutes. I had to use some tricks with bloom filters and LRU caching, but can build a 6 gig db in like 20ish minutes now

zeroq

It's roughly 10Gb across several CSV files.

I create a new in-mem db, run schema and then import every table in one single transaction (in my testing it showed that it doesn't matter if it's a single batch or multiple single inserts as long are they part of single transaction).

I do a single string replacement per every CSV line to handle an edge case. This results in roughly 15 million inserts per minute (give or take, depending on table length and complexity). 450k inserts per second is a magic barrier I can't break.

I then run several queries to remove unwanted data, trim orphans, add indexes, and finally run optimize and vacuum.

Here's quite recent log (on stock Ryzen 5900X):

   08:43 import
   13:30 delete non-essentials
   18:52 delete orphans
   19:23 create indexes
   19:24 optimize
   20:26 vacuum

thechao

Millions of rows in minutes sounds not ok, unless your tables have a large number of columns. A good rule is that SQLite's insertion performance should be at least 1% of sustained max write bandwidth of your disk; preferably 5%, or more. The last bulk table insert I was seeing 20%+ sustained; that came to ~900k inserts/second for an 8 column INT table (small integers).

pessimizer

Saying that 30 minutes seems long is like saying that 5 miles seems far.

conradev

SQLite has an official tool for this, fwiw: https://www.sqlite.org/rsync.html

It works at the page level:

> The protocol is for the replica to send a cryptographic hash of each of its pages over to the origin side, then the origin sends back the complete content of any page for which the hash does not match.

CBLT

Yeah, but unfortunately the SQLite team doesn't include that tool with their "autotools" tarball, which is what most distros (and brew) use to package SQLite. The only way to use the tool is to compile it yourself.

conradev

Yeah, that’s a bummer. It does appear to be in nixpkgs, though:

  nix-shell -p sqlite-rsync

dgfitz

Realistically, are you using SQLite if you can’t compile and source control your rev of the codebase? Is that really a big deal?

rcxdude

Yes, it's extremely common to be using it and not even be compiling anything yourself, let alone C or any support libraries.

CBLT

`sqlite3_rsync` must be installed on the remote host too, so now you're cross-compiling for all the hosts you manage. It also must be installed into the PATH the ssh uses, which for a number of operating systems doesn't include /usr/local/bin. So I guess you're now placing your sshd config under configuration management to allow that.

These tasks aren't that challenging but they sure are a yak shave.

hundredwatt

The recently released sqlite_rsync utility uses a version of the rsync algorithm optimized to work on the internal structure of a SQLite database. It compares the internal data pages efficiently, then only syncs changed or missing pages.

Nice tricks in the article, but you can more easily use the builtin utility now :)

I blogged about how it works in detail here: https://nochlin.com/blog/how-the-new-sqlite3_rsync-utility-w...

rsync

Also note:

sqlite3_rsync is now built into the rsync.net platform.

  ssh user@rsync.net sqlite3_rsync … blah blah …

… just added last week and not rolled out in all regions but … all initial users reported it worked exactly as they expected it to.

jgalt212

sqlite_rsync can only be used in WAL mode. A further constraint of WAL mode is the database file must be stored on local disk. Clearly, you'd want to do this almost all the time, but for the times this is not possible this utility won't work.

SQLite

I just checked in an experimental change to sqlite3_rsync that allows it to work on non-WAL-mode database files, as long as you do not use the --wal-only command-line option. The downside of this is that the origin database will block all writers while the sync is going on, and the replicate database will block both reads and writers during the sync, because to do otherwise requires WAL-mode. Nevertheless, being able to sync DELETE-mode databases might well be useful, as you observe.

If you are able, please try out this enhancement and let me know if it solves your problem. See <https://sqlite.org/src/info/2025-05-01T16:07Z> for the patch.

SQLite

Update: This enhancement is now on trunk and will be included in the 3.50.0 release of SQLite due out in about four weeks.

remram

WAL mode works on many network filesystems provided it's being written from a single host at a time.

jgalt212

I'm not sure understand your comment. Regardless of WAL or network filesystem usage, the sqlite file cannot be written to from multiple processes simultaneously. Am I missing something here, or did you misstate?

construct0

Demands increasing page size if you sync frequently (bandwidth).

mromanuk

I was surprised that he didn't try to use on the flight compression, provided by rsync:

  -z, --compress              compress file data during the transfer
      --compress-level=NUM    explicitly set compression level

Probably it's faster to compress to gzip and later transfer. But it's nice to have the possibility to improve the transfer with a a flag.

jddj

Or better yet, since they cite corruption issues, sqlite3_rsync (https://sqlite.org/rsync.html) with -z

sqlite transaction- and WAL-aware rsync with inflight compression.

crazygringo

The main point is to skip the indices, which you have to do pre-compression.

When I do stuff like this, I stream the dump straight into gzip. (You can usually figure out a way to stream directly to the destination without an intermediate file at all.)

Plus this way it stays stored compressed at its destination. If your purpose is backup rather than a poor man's replication.

schnable

The main point was decreasing the transfer time - if rsync -z makes it short enough, it doesn't matter if the indices are there or not, and you also skip the step of re-creating the DB from the text file.

crazygringo

The point of the article is that it does matter if the indices are there. And indices generally don't compress very well anyways. What compresses well are usually things like human-readable text fields or booleans/enums.

worldsavior

I believe compression is only good on slow speed networks.

PhilipRoman

It would have to be one really fast network... zstd compresses and decompresses at 5+ GB (bytes, not bits) per second.

o11c

I just tested on a ramdisk:

  tool  cspeed    size  dspeed
  zstd  361 MB/s  16%   1321 MB/s
  lzop  512 MB/s  29%    539 MB/s
  lz4   555 MB/s  29%   1190 MB/s

If working from files on disk that happen not to be cached, the speed differences are likely to disappear, even on many NVMe disks.

(It just so happens that the concatenation of all text-looking .tar files I happen to have on this machine is roughly a gigabyte (though I did the math for the actual size)).

masklinn

Ain't no way zstd compresses at 5+, even at -1. That's the sort of throughputs you see on lz4 running on a bunch of core (either half a dozen very fast, or 12~16 merely fast).

worldsavior

Where are you getting this performance? On the average computer this is by far not the speed.

berbec

Valve tends to take a different view...

stackskipton

Valve has different needs then most. Their files are rarely change so they only need to do expensive compression once and they save a ton in bandwidth/storage along with fact that their users are more tolerant of download responsiveness.

cogman10

Is the network only doing an rsync? Then you are probably right.

For every other network, you should compress as you are likely dealing with multiple tenants that would all like a piece of your 40Gbps bandwidth.

worldsavior

In your logic, you should not compress as multiple tenants would all like a piece of your CPU.

rollcat

Depends. Run a benchmark on your own hardware/network. ZFS uses in-flight compression because CPUs are generally faster than disks. That may or may not be the case for your setup.

creatonez

What? Compression is absolutely essential throughout computing as a whole, especially as CPUs have gotten faster. If you have compressible data sent over the network (or even on disk / in RAM) there's a good chance you should be compressing it. Faster links have not undercut this reality in any significant way.

bityard

Whether or not to compress data before transfer is VERY situationally dependent. I have seen it go both ways and the real-world results do not not always match intuition. At the end of the day, if you care about performance, you still have to do proper testing.

(This is the same spiel I give whenever someone says swap on Linux is or is not always beneficial.)

Jyaif

He absolutely should be doing this, because by using rsync on a compressed file he's passing by the whole point of using rsync, which is the rolling-checksum based algorithm that allows to transfer diffs.

berbec

or used --remove-source-files so they didn't have to ssh back to rm

M95D

Saving to text file is inefficient. I save sqlite databases using VACUUM INTO, like this:

  sqlite3 -readonly /path/db.sqlite "VACUUM INTO '/path/backup.sqlite';"

From https://sqlite.org/lang_vacuum.html :

  The VACUUM command with an INTO clause is an alternative to the backup API for generating backup copies of a live database. The advantage of using VACUUM INTO is that the resulting backup database is minimal in size and hence the amount of filesystem I/O may be reduced.

nine_k

It's cool but it does not address the issue of indexes, mentioned in the original post. Not carrying index data over the slow link was the key idea. The VACUUM INTO approach keeps indexes.

A text file may be inefficient as is, but it's perfectly compressible, even with primitive tools like gzip. I'm not sure the SQLite binary format compresses equality well, though it might.

conradev

SQLite tosses out the SQL once it is parsed into bytecode. Using text is just going to take longer, even though I’m sure it works great.

You can modify the database before vacuuming by making a new in-memory database, copying selected tables into it, and then vacuuming that to disk.

nine_k

This should be the accepted answer.

vlovich123

> A text file may be inefficient as is, but it's perfectly compressible, even with primitive tools like gzip. I'm not sure the SQLite binary format compresses equality well, though it might.

I hope you’re saying because of indexes? I think you may want to revisit how compression works to fix your intuition. Text+compression will always be larger and slower than equivalent binary+compression assuming text and binary represent the same contents? Why? Binary is less compressible as a percentage but starts off smaller in absolute terms which will result in a smaller absolute binary. A way to think about it is information theory - binary should generally represent the data more compactly already because the structure lived in the code. Compression is about replacing common structure with noise and it works better if there’s a lot of redundant structure. However while text has a lot of redundant structure, that’s actually bad for the compressor because it has to find that structure and process more data to do that. Additionally, is using generic mathematical techniques to remove that structure which are genetically optimal but not as optimal as removing that structure by hand via binary is.

There’s some nuance here because the text represents slightly different things than the raw binary SQLite (how to restore data in the db vs the precise relationships + data structures for allowing insertion/retrieval. But still I’d expect it to end up smaller compressed for non trivial databases

dunham

Below I'm discussing compressed size here rather than how "fast" it is to copy databases.

Yeah there are indexes. And even without indexes there is an entire b-tree sitting above the data. So we're weighing the benefits of having a domain dependent compression (binary format) vs dropping all of the derived data. I'm not sure how that will go, but lets try one.

Here is sqlite file containing metadata for apple's photo's application:

    767979520 May  1 07:28 Photos.sqlite

Doing a VACUUM INTO:

    719785984 May  1 08:56 photos.sqlite

gzip -k photos.sqlite (this took 20 seconds):

    303360460 May  1 08:56 photos.sqlite.gz

sqlite3 -readonly photos.sqlite .dump > photos.dump (10 seconds):

    1277903237 May  1 09:01 photos.dump

gzip -k photos.dump (21 seconds):

    285086642 May  1 09:01 photos.dump.gz

About 6% smaller for dump vs the original binary (but there are a bunch of indexes in this one). For me, I don't think it'd be worth the small space savings to spend the extra time doing the dump.

With indexes dropped and vacuumed, the compressed binary is 8% smaller than compressed text (despite btree overhead):

    566177792 May  1 09:09 photos_noindex.sqlite
    262067325 May  1 09:09 photos_noindex.sqlite.gz

About 13.5% smaller than compressed binary with indices. And one could re-add the indices on the other side.

gwbas1c

Does that preserve the indexes? As the TFA mentioned, the indexes are why the sqlite files are huge.

M95D

You're right. It does. I never thought about it until you asked.

4silvertooth

I think it won't preserve the index but it will recreate the index while running the text sql.

simlevesque

In DuckDB you can do the same but export to Parquet, this way the data is an order of magnitude smaller than using text-based SQL statements. It's faster to transfer and faster to load.

https://duckdb.org/docs/stable/sql/statements/export.html

uwemaurer

you can do it with a command line like this:

   duckdb -c "attach  'sqlite-database.db' as db;  copy db.table_name to 'table_name.parquet' (format parquet, compression zstd)"

in my test database this is about 20% smaller than the gzipped text SQL statements.

simlevesque

That's not it. This only exports the table's data, not the database. You lose the index, comments, schemas, partitioning, etc... The whole point of OP's article is how to export the indices in an efficient way.

You'd want to do this:

     duckdb -c "ATTACH 'sqlite-database.db' (READ-ONLY); EXPORT DATABASE 'target_directory' (FORMAT parquet, COMPRESSION zstd)"

Also I wonder how big your test database is and it's schema. For large tables Parquet is way more efficient than a 20% reduction.

If there's UUIDs, they're 36 bits each in text mode and 16 bits as binary in Parquet. And then if they repeat you can use a dictionary in your Parquet to save the 16 bits only once.

It's also worth trying to use brotli instead of zstd if small files is your goal.

RenThraysk

SQLite has an session extension, which will track changes to a set of tables and produce a changeset/patchset which can patch previous version of an SQLite database.

https://www.sqlite.org/sessionintro.html

oefrha

I have yet to see a single SQLite binding supporting this, so it’s quite useless unless you’re writing your application in C, or are open to patching the language binding.

In one of my projects I have implemented my own poor man’s session by writing all the statements and parameters into a separate database, then sync that and replay. Works well enough for a ~30GB database that changes by ~0.1% every day.

pdimitar

Well, my upcoming Elixir wrapper of a Rust wrapper of SQLite (heh, I am aware how it sounds) will support it. I am pretty sure people do find it useful and would use it. If not, the 1-2 days of hobby coding to deliver it are not something I would weep over.

paulclinger

I have updated the Lua binding to support the session extension (http://lua.sqlite.org/home/timeline?r=session) and it's been integrated into the current version of cosmopolitan/redbean. This was partially done to support application-level sync of SQLite DBs, however this is still a work in progress.

RenThraysk

There are atleast two SQLite bindings for Go.

https://github.com/crawshaw/sqlite

https://github.com/eatonphil/gosqlite/

Ended up with the latter, but did have to add one function binding in C, to inspect changesets.

ncruces

I'm open to adding it to my driver, if people consider it essential.

Every extra bit makes AOT compiling the Wasm slower (impacting startup time).

I also wanna keep the number of variants reasonable, or my repo blows up.

Add your votes for additional features to this issue: https://github.com/ncruces/go-sqlite3/issues/126

simonw

Have you used that? I've read the documentation but I don't think I've ever heard from anyone who uses the extension.

RenThraysk

I have, atleast to confirm it does what it says on the tin.

Idea for an offline first app, where each app install call pull a changeset and apply it to their local db.

nickpeterson

I really wish SQLite had some default way of doing change data capture via session or something similar.

rarrrrrr

If you're regularly syncing from an older version to a new version, you can likely optimize further using gzip with "--rsyncable" option. It will reduce the compression by ~1% but make it so differences from one version to the next are localized instead of cascading through the full length of the compression output.

Another alternative is to skip compression of the dump output, let rsync calculate the differences from an previous uncompressed dump to the current dump, then have rsync compress the change sets it sends over the network. (rsync -z)

rabysh

I think this could be a single pipeline?

ssh username@server "sqlite3 my_remote_database.db .dump | gzip -c" | gunzip -c | sqlite3 my_local_database.db

rabysh

gzip/gunzip might also be redundant if using ssh compression with -oCompression=on or -C on the ssh call

sneak

My first thought, too. It also seems somewhat glaringly obvious that it needs a `pv` in there, as well.

xnx

Great example of Cunningham's Law: https://en.wikipedia.org/wiki/Ward_Cunningham#:~:text=%22Cun...

nottorp

Wait... why would you even think about rsyncing a database that can get changed while being copied?

Isn't this a case for proper database servers with replication?

Or if it's an infrequent process done for dev purposes just shut down the application doing writes on the other side?

HN

A faster way to copy SQLite databases between computers

A faster way to copy SQLite databases between computers