UUIDs are so funny today because almost all of their deployments is "we take a string of mostly-random bits but give it structure for no apparent reason"

it's true that you could populate some of those bits with predictable sequences to avoid clashing (a la snowflake) but guess what: you can do this to a hex string or a number (as does snowflake). i think the only reason UUIDs are as popular as they are today is Microsoft?

@whitequark Most applications that use UUIDs these days tend to use version 7 UUIDs, which use milliseconds since the Unix epoch as the most significant bits. This embeds the creation timestamp in the ID and allows for sorting, while also adding in sufficient randomness so they're not incremental and multiple systems can generate them with low probability of collision.

@whitequark @ramsey I've only ever seen version 4 (unstructured random) in production that i can recall

Follow

@azonenberg @whitequark @ramsey

V7 is fairly new, standardised around 2024. They've got a bit more adoption in databases over the last year.

@intrbiz @azonenberg @whitequark @ramsey Yeah, postgres 18 added support for generating them IIRC, though UUIDs are inherently "mostly backwards compatible" unless you're trying to parse them for some godforsaken reason, so older versions support it just fine if the client generates them.

They make for much happier indexes and sharding vs the ones with leading-random, because most workloads don't have truly random access patterns...

@becomethewaifu

Indeed I did a talk at POSETTE last year talking about encoding information into UUIDs and some of the index issues.

IMHO you can have more fun that just encoding generation time into them.

@intrbiz @becomethewaifu Version 8 is probably more suited to those use-cases, though.

@ramsey

Indeed, my code was generating UUIDs marked as V8. The version is just a nibble that's been standardised. And it's handy to have a standardised version number for custom generation schemes.

@intrbiz @ramsey time-based (type 1 back in the day, now type 7) is really useful for debugging; the creation date and time can be a really strong hint.

@kw217 @intrbiz Version 1 still exists, but it’s based on the weird value of 100-nanosecond intervals since the Gregorian epoch in 1582. Version 7 is based on milliseconds since the Unix epoch.

@kw217 @intrbiz One reason not to use version 1 is that it leaks details about the system (i.e., the MAC address). Another reason is that the values aren’t sortable. Version 6 was introduced to solve this. It’s also based on 100-nanosecond intervals since the Gregorian epoch, but it’s sortable and uses random bytes following the timestamp, rather than the MAC address.

But, for most purposes, version 7 is the right solution, unless you need to create UUIDs for dates earlier than 1970.

@ramsey @kw217 @intrbiz as a privacy person we should say that individuals and activist groups probably want to avoid leaking timestamps, especially fine-grained timestamps, pretty much everywhere and always, although the attack models are highly indirect and people don't usually know what they're protecting until they've lost it

it's fine for corporate use as long as it can't be tied to an individual

@ireneista @ramsey @intrbiz entirely fair - this was in a corporate context; in activism circles (hmm, in most places now) one should definitely think carefully about privacy properties.

@ireneista @ramsey @kw217

Valid concern in some domains for sure. I tend to keep the time buckets pretty big. There are block based approaches which remove the time related issues but still reduce issue with things like indexes.

@intrbiz @ramsey @kw217 yeah, fascinatingly, if you allocate IDs at a global scale via sharding, the shards wind up forming a weak proxy for geographic location

this stuff is really hard to get right

@ireneista @ramsey @kw217 @intrbiz huh? what does that reveal, other than a time that your computer was on and (maybe, if your UUID generation code is deeply misconfigured) your timezone?

@AVincentInSpace @ramsey @kw217 @intrbiz well, exactly that. whether that's a problem depends on what you use it for.

@ireneista @AVincentInSpace @kw217 @intrbiz Right. Let’s say that someone is trying to find out what you were doing at a certain time. If they find an ID that was generated at a specific time and owned by your account, then they can deduce a general idea of what you might have been doing at that time (i.e., you were probably using a certain app during that time).

@ramsey @ireneista @AVincentInSpace @intrbiz having very fine grained timing info lets you very precisely correlate messages across systems. Of the thousands of messages that went across the network in this second, despite any crypto, you can say *this* one was the one your subject sent, and *this* is where it went in the network, with high degree of confidence. (Other attacks too, like fingerprinting the sender's clock, but they're a bit more involved.)

@ramsey @intrbiz good ol' Microsoft "hectonanoseconds". Weird unit but presumably shoehorned into a particular bitwidth and range that made sense back in the day (Windows NT?)

@ramsey @intrbiz that article has the Apollo timestamp (v0) as 48 bits, which isn't wide enough for hectonanoseconds. Wikipedia agrees they came in with Windows in the nineties.

@kw217 @intrbiz You’re right. The article doesn’t say when they started using the Gregorian timestamp. I was making an assumption. I guess Microsoft was the first to do that?

@ramsey @intrbiz I'm making an assumption too; I was hoping Raymond Chen would have something definitive but he doesn't. NT uses 100ns units but a different origin (1601 AD).

@kw217 @intrbiz I wonder why 1601. I know that Great Britain didn’t adopt the Gregorian calendar until 1752, so that year would make sense to me.

Sign in to participate in the conversation
Mastodon

Time for a cuppa... Earl Grey please!