Disqus Comments

Aaron Toponce • 11 years ago

I know this article is old, and prices have changed. However, I want to just nitpick on the finances in your article. You say:

"Let's assume that the average cost of 1 TB of disk (including controller, enterprise class drives, etc.) is at $1000 for the sake of simplicity: Then dedup would save us $5000 in this particular example."

Assuming "enterprise class drives" means 15k SAS drives, then that works. But, ZFS is designed with commodity storage in mind. Further, it's likely that "enterprise" datacenters with "enterprise storage" aren't designing storage with ZFS in mind. Intsead, they're using EMC or NetApp SANs (I've worked in the "enterprise" industry, and know what these Big Budgets purchase).

I would be willing to bet that most shops that do actually deploy large ZFS storage pools, are either using Sun/Oracle hardware, with a sales representative, or they're a SMB with Supermicro commodity x86 hardware, and they're deploying SATA disk. There are many more SMBs in the United States than "enterprise" corporations, by almost a factor of 2:1 [1]. So, the likely scenario is SATA disk for ZFS storage, and not 15k SAS drives.

1- http://www.sba.gov/content/...

In which case, 1TB of disk would more likely cost you $200 (in 2011 when this article was written), saving you $1000 for your 10 TB array example, or 1/5 of your initial estimation.

So, now we get to your SSD and RAM costs, which I agree with mostly. You say:

"For our fictitious example of a 10 TB storage pool for VDI with an expected dedup savings of 5 TB which translates into $5000 in disk space saved, we'd need to invest in $400 worth of SSD or better $4000 of RAM. That still leaves us with at least $1000 in net savings which means that in this case dedup is a (close) winner!"

If we invest in $400 worth of SSD, when we net $600. But, you fail to mention the performance penalty you will incur, even with the DDT on fast SSD. It's not native RAM. A $400 fast SSD will probably give you 20,000 IOPS. DDR3 RAM will net you 100,000 IOPS. So, you take a performance penalty of 5x using a $400 SSD to store the DDT on. So, if we spend the $4000 in RAM to keep the performance high, then we've lost $3000.

Regardless of the scenario, it's a net loss when using commodity hardware, whether you get hit on performance, or you get hit in your wallet. And this is with a generous 2x dedup savings. It's probably closer to 1.5x or 1.2x, which means the costs get worse.

You know your readership better than I do, but I would be willing to bet that most reading this article are using commodity hardware over "enterprise" hardware. So the article isn't a good representation of what most users will actually expect with their setup.

Rather, I find compression to be a much larger win. You can get considerable savings provided some stereotypical scenarios, and the cost is negligible. Compression doesn't tax the CPU, does't require a lookup table, and actually provides disk read and write performance, because you don't need to physically traverse as much platter disk as you would with non-compressed data. Dedup has its place, no doubt, but it's a niche demographic, where compression fits a much larger demographic.

Just my $0.02.

Constantin • 11 years ago

Hi Aaron,

thanks for your thoughtful comment.

I agree: The economics of storage are changing all the time and it’s always useful to assess whether feature X is really worth the effort. And yes, as CPU power becomes cheaper and more abundant vs. the scarcity of IOPS, compression is almost always a win.

The need to dedup is really a function of the cost of storing data vs. the amount of new data that is coming in that you don’t want to ignore/sacrifice/delete. The changes in technology and cost are just one side of the equation – the availability and the value of acquiring new data are the other side.

For businesses with a stagnant or not dramatically increasing volume of data, dedupe is becoming less interesting as the cost of storage goes down.

But there are also businesses with a very large amount of data to store. And then it makes sense to consider dedupe, beyond just mere compression.

Cheers,
Constantin

Ludovic Urbain • 13 years ago

Nice article - I think it's worth mentioning that only RAM will help your dedupe speed if you're already running an all-flash array, and also that dedupe factors (and collisions) are directly related to pool size (I don't remember the probability details but it's better than linear by nature).

Constantin • 13 years ago

i Ludovic,

thank you for your comment!

Yes, only RAM helps if your pool is Flash-only, but then you're very lucky to not depend on rotating rust at all :).

The dedupe factor is dependent of the amount of redundancy in the pool (the number of equal blocks vs. the number of unique blocks), which for statistical reasons improves as the pool size grows. But it really depends on the nature of the data and how likely the use case is going to generate duplicate blocks, so there's no real rule of thumb other than to measure the dedupe ratio with an as represantative subset of data as possible.

Cheers,
Constantin

Ludovic Urbain • 13 years ago

I don't think you can measure a dedupe ratio with a subset - that's the very nature of dedupe ;)

Also one needs to be careful about "use cases" as many seem to take the general word that databases don't get good dedupe/compression, which is blatantly false (just all those 15000 users with the "1234" pin code are a good example ;).

George • 12 years ago

Ahem, I should hope that dedup would not be useful in this case, because you're using individual user password salts aren't you?

Ludovic Urbain • 12 years ago

And what exactly makes you think "individual user password salts" are *such* a good idea ?

If you're referring to the captain hindsights discussing the database leaks, you should by now be aware that the problem lies not with the salts but with the lack of security around backups.

Furthermore, do you *really* think 512-byte dedupe (standard) is going to dedupe BLOCKS that contain two passwords (often 256 byte hashes), no matter the hashing method ?

Lastly, compression, which imho is _also_ deduplication, works on passwords too, with or without hashes and at any level of cryptography.

It's a property of any data that there *will* be more duplicates as size grows.

So yeah, individual salts are nice. but if you're going to store them in the database, that's not very useful either ;)

And in the end, who are we kidding ?

The most secure password hashing method for a service that contains nothing important and for which the password is sunshine1996 ? riiight.

And security only ever matters until they have access to your servers.

Once they're there you're fucked and no amount of salting will save you .

They've got your binaries, your keys, files and database so the game is over.

Gobs • 12 years ago

That's not a responsible way to handle your user's security. You should protect your data and backups, sure, but you should also make sure that an attacker can't get a usable password list out of a database dump. This is to protect the users who reused their passwords elsewhere.

tl;dr: defense in depth

Ludovic Urbain • 12 years ago

That's utter bullshit.
defense in depth ?

So on one side, you're going to generate semi-random individual salts to protect from possible db leaks, which you will store in the same database, thereby nullifying most of your *tactical* advantage.

And on the other side, your system has so many security flaws database backups get leaked ?

I think very much that individual salts are completely overkill for people who use SSL (lol) as only security, and have security holes big enough to leak db backups.

George • 12 years ago

This is an incredibly irresponsible and naive view of user credentials management. You seem to be arguing that if we consider the protection worthwhile in the first place, we are admitting we're insecure enough to get dumped.

Yes, we'll also do everything in our power to prevent that. But in the cases where it happens (and you must KNOW it's not possible to GUARANTEE you won't get your db leaked), at least you can tell your users - don't worry, your passwords are secure and there is (virtually) no risk they will decipher them and pair them up with your e-mail address, to, say your PayPal account.

Sorry Constantin but this guy might be managing a db with one of MY passwords on!

Constantin • 12 years ago

Guys, stay cool. This is a ZFS article, not a security one. Go discuss this on a security forum or better yet, in private.

Thanks :).

Constantin

Redstar • 13 years ago

Hello,
this article helped me a lot understanding the RAM requirements for high-speed zfs use, especially the part about block size measurement.

Still I think I agree with Olli when it comes to the 1/4 of ARC limitation: If I imagine a system without dedup, then all matadata has to be stored in the given RAM. So when I add RAM in order to accomodate the dedup table, I should be able to increase the amount of metadata in the ARC without a severe penalty: There still is the same amount for other ARC data and the new dedup table sits in the 'new RAM'.

So I think following your advice and adding 4x the expected dedup table size in RAM means that you invest more than you need to which means the break-even point for dedup use might be at lower deduplication multipliers already.

So I would be pleased if you could explain once more, why I want addional RAM for ARC in case I want to use dedup. Nevertheless the article (including your replies to the comments) is a great source of information about dedup.

Constantin • 13 years ago

Hi Redstar,

thanks for your comment!

Yes, the ARC uses some of its RAM to store the dedup table and some to store metadata. For optimum performance, you want both to fit into RAM. If RAM is expensive (which it usually is not, compared to the cost of not getting to critical data quickly enough), then you can trade off RAM for performance by using less RAM and letting the ARC prioritize which pieces of the dedup table are most likely needed and which metadata it will need next.

So RAM sizing per my requirements assumes you want maximum performance and then you get enough to store both dedup table and metadata. But you can trade off as much as you want, including the use of L2ARC on SSDs.

Once more: If you use dedup, then ZFS needs more data structures to keep track of dedup blocks (the dedup table) and hence it needs more RAM to cache it.

How much exactly is up to you and your performance requirements. You can also set up a system and measure the exact requirements by running arc_summary.pl. Perhaps you can get away with a bit less RAM.

Cheers,
Constantin

77109 • 13 years ago

Hi,
i love your explication. So for each block we have 230 bytes of dedup table but do you know the amount of the rest of metada? To know the real amount of metada by block on zfs and have to calculate the amont of ram.
Thanks

Constantin • 13 years ago

Hi 77109,

thanks for your comment!

The other metadata that ZFS needs to go through to access a specific block is the sequence of metadata blocks from the uberblock that leads to it. Every metadata-block contains up to 256 pointers to the next level of metadata blocks. If we take the lowest level of metadata blocks before the actual blocks, we're talking about a 256th or less of total blocks devoted to metadata at that level. Each metadata block is stored 3 times for redundancy, which makes around 1.2% of all blocks. We can ignore all upper levels of metadata blocks for simplicity since they'll only influence the number in the order of 1% of the actual percentage.

Assuming that your ZFS file system is not full and assuming that the metadata block sizes are similar to data block sizes, we can assume that total metadata for a ZFS pool is in the order of 1% of pool size.

So, for every TB of ZFS pool data, the metadata on disk is in the order of 10 GB. Since we calculated a 3x factor for redundancy, we can divide this back for actual RAM usage (metadata blocks in RAM aren't stored redundantly), so memory usage is actually around 3GB per TB of pool data. The more RAM (or L2ARC) you have in your system, the more of that metadata is going to be readily available there, and the easier it becomes to access the pool's data since extra seeks for metadata blocks are avoided.

Hope this helps,
Constantin

Olli • 13 years ago

But what if you change your ARC to use more than 1/4 for Metadata Caching? Just set zfs:zfs_arc_meta_limit in /etc/system and you can use much much more RAM for Metadata-L2ARC (in fact we use almost the whole available RAM for this)

Constantin • 13 years ago

Yes, you can tune your way out of spending that much RAM and grow the Metadata piece of ARC - at the expense of caching data blocks. If all of your data read is 100% random, you can do that, but in most cases, there's going to be a good number of data blocks you want to cache in that other 3/4 of your ARC.

Howadah • 13 years ago

I didn't realize that dedupe gobbled so much ram. Hard-drives are getting cheaper every day too....

Foresee any problems with turning dedupe off after its been on for a while?

Constantin • 13 years ago

Hi Hayes,

thanks for your comment!

Well, it doesn't necessarily eat RAM: It just grows a significantly large data structure that will compete for ARC space in you limited RAM depending on the frequency it is being used. That said, if the dedup table isn't in RAM due to it being limited and if an app needs to write something, that write can be quite slow. And in situations of higher sustained write load (as in "virtualization" or "file server"), the dedup table can become a bulky cache citizen.

If you turn dedupe off, then the DDT won't be used for writing new data any more, so the ARC will be freed to store other data and metadata.

Cheers,
Constantin

fblittle • 13 years ago

What is the cost of Dedupe in terms of write speed if you use a spinning disk? For large file writes, as in video, what percentage will it slow the writes? Rule of thumb.

Also with reference to the above question:
Foresee any problems with turning dedupe off after its been on for a while?

If I have turned Dedupe on in a pool and it has been running a few months with a ratio of 1.2 or so, If I turn Dedupe off, how will it repopulate my pool with the duplicate data? Will my data be highly fragmented when done?

Constantin • 13 years ago

It's hard to tell, because the cost of dedup is highly individual. Here's what happens over time:
- As the dedup table grows, it starts to compete with other metadata for ARC space.
- As a result, read speeds can suffer as ZFS needs to handle extra metadata reads to access blocks on disks.
- Similarly, writes will trigger extra reads for dedup entries and metadata if RAM becomes scarce.

It's not the end of the world, but it is noticeable and I've heard of numerous cases where performance was bad after some time and the reason was simply low RAM and dedup. After turning it off, those cases immediately experienced better performance.

When dedupe is turned off, the dedup table simply isn't used anymore and more RAM is freed to serve for other metadata. Old, deduped data stays deduped, new, duplicate data will not be deduped and over time the dedup ratio will drop.

I'm not aware of any fragmentation issues that would be dedupe related.

Hope this helps,
Constantin

BoruG • 11 years ago

Wonder if this is the case, why these engineers comeup with some way to do the dedupe in the background. zfs try to do this real time and that is not practical. They need to perform dedup in background. Hope someone can comeup with a way to do it.

Bp_sti • 12 years ago

I'm a big fan of ZFS but I do have to ask. Why is dedup so much more "expensive" on ZFS when its virtually "free" in windows 2012? It seems like (from the limited amount i've read on the subject) that for just about any scenario where it would be useful you should turn it on in windows 2012, while in ZFS you have to be very very careful when and where you use it.

Constantin • 12 years ago

Hi Ben,

I don’t know how Microsoft handles deduplication, so I can’t comment on that. The design of deduplication in ZFS is pretty simple and straightforward because it leverages the checksums that are already there, with little overhead. And it is instant: No after the fact scanning of duplicate blocks. The price is not that high: More RAM and/or an cache device will speed it up and also benefit many other ZFS transactions, but it's not strictly required, though strongly advised. This article is more about how to do your cost/benefit analysis, then decide whether dedupe is a good idea for your particular use-case.

Deduplication is never free: Some effort needs to go into identifying duplicates, managing the deduplicated blocks etc.

Cheers,
Constantin

Bp_sti • 12 years ago

Ok that leads me into another question: Can dedup be enabled on a per "filesystem" basis? In other words could I make a separate "filesystem" in my pool (since all the "filesystems" will all share the same total space available on the drive) and enable dedupe *only* on that filesystem? If that would work then would it "pay off" to use it on a filesystem just for backups from windows machines? In other words if I ran a full "drive image" (ISO) style backup 3 times a week would I end up saving space or would those ISO files need to be identical or could ZFS dedup them?

Windows 2012 "seems" to offer extremely high dedup rates with almost no memory "costs". Its doing something vastly different then ZFS apparently. I've seen people enable dedup on 4+ TB of data and the machine itself not have more then 8-16GB of ram while here with ZFS we are talking 20GB *per TB* of storage.

Honestly I'll probably just use a dedup enabled backup program like eXdupe or something. :)

Constantin • 12 years ago

Yes, dedupe on ZFS can be enabled on a per filesystem basis.

You can have dedupe on ZFS with 8-16GB of RAM on a 4+ TB pool without any problems, we're just optimizing performance here.

I suggest you just try it out with your use-case and see for yourself.

Dedupe on ZFS is based on blocks, so multiple ISOs of the same or incremental data should dedupe just fine.

Cheers,
Constantin

ChickenBall • 12 years ago

Great article! While this isn't exactly ZFS related, one really good selling point for a business case is reduced network utilization and job length for off-site dedupe backup. :)

Constantin • 12 years ago

Hi Yawhann,

thanks for your comment! Yes, deduplication saves IO everywhere. This is why compressed filesystems are also a plus in most cases: the IOs they save are more significant than the CPU overhead for compression.

Cheers,
Constantin

Boris Protopopov • 12 years ago

Hello, Constantin,
interesting discussion; the post refers to the ZFS FAQs for the way to assess the max amount of memory taken by the DDT entries. I think there is a bit of overestimation there:

looking at the code in illumos, for instance, one can see that there are two ways DDT consumes memory:

1) as blocks buffered in ARC
2) as ddt_entry_t in the avl_trees in ddt_t structs

the latter is transient as it is destroyed at every txg processing boundary (spa_sync() call), and the former is limited to 1/4 of ARC size by default, as mentioned in this post. If we were to let the ARC grow unlimited, the size taken in ARC would be:

'on-disk DDT entry size' times 'total number of entries'

The on-disk size is 64 bytes (echo "::sizeof ddt_phys_t" | sudo mdk -k"), which is significantly less than 320 bytes - the estimate used for calculations in this post and the ZFS FAQs. The ddt_phys_t can be compressed, so, in fact, the size of DDT entries on disk is 64 bytes or less.

I believe a better way to estimate DDT memory requirements is to use the size of the DDT entry on disk, as opposed to the size of in-core DDT entry ddt_entry_t (which appears to be 376 bytes at this time: echo "::sizeof ddt_entry_t" | sudo mdb -k). The reason for this is that DDT is cached as blocks in ARC, not as ddt_entry_t structs, as I understand it.
Best regards, Boris.

Constantin • 12 years ago

Hi Boris,

thanks for sharing, this sounds like a great idea!

Thanks,
Constantin

zhivotnoe • 12 years ago

Great! Thank you for analysis.

patkoscsaba • 13 years ago

I am curious, wasn't the '1/4' limit of zfs_arc_meta_limit removed with the solving of CR# 6957289 - "ARC metadata limit can have serious adverse effect on dedup performance". ?
Reading that bug and some forums, it seems like, the limit was modified to default to the arc_c_max. ( https://forums.oracle.com/f... )

Constantin • 13 years ago

Hi Patkos,

yes, if you're running a version of Solaris where this Bug has been integrated, then you don't need to worry about the 1/4 ARC limitation of metadata.

Cheers,
Constantin

James Trent • 13 years ago

Great article, do you think it is necessary to use industry level data deduplication software for home use?

Constantin • 13 years ago

ZFS offers industry level data deduplication at a fraction of the cost, so it is accessible for home use. So the question boils down to: How can I save more, through buying more disks for my data or for expanding RAM so I can dedup efficiently?

BTW, please refrain from SEO-Linking non-relevant stuff in comments, I removed the link from your post. Nice try.

Jeff • 13 years ago

If you turn dedupe off for the performance and later turn it back on, will it try (or can you make it) scavenge duped blocks. I guess I am wondering if there would be a reasonable way to schedule high performance during peak usage, and still ultimately get the space savings by doing cleanup during idle time.

Constantin • 13 years ago

Hi Jeff,

no, there is no such thing as dedup-after-the-fact in ZFS dedup. The only thing you could do would be to trigger a re-write of the blogs by forcing a re-write of the data.If you expect significant savings from dedupe, then it makes sense to expand your RAM and add an L2ARC SSD for handling the extra dedupe table requirements. That is the most efficient option.

Hope this helps,
Constantin

Steve Gonczi • 9 years ago

Hello everybody,

In addition to the drawbacks already mentioned, dedup imposes a performance penalty on the entire zpool where it is turned on.

There are 3 on-disk tables: the singletons, the duplicates (entries with ref count >1) and the dittos. Even of we achieve little or no actual dedup, we still have this speculative singleton table. This has to be consulted in the write path, to see if we can dedup. This imposes a constant extra random read load when writing data, for little or no benefit.

As we delete deduped files, the table does not go away, because it is a zap table. It shrinks a somewhat because of compression and trailing zero elimination, but not significantly. To observe the effect of this: Take a look at the dedup stats via zdb or the zpool command's -D option, and you will see that the reported size of the dedup dictionary entries goes up as you delete blocks.

This is because the size of the entries is calculated by dividing the on-disk table size by the number of entries.
So, a more-or-less unchanged zap table, combined with a reduced number or deduped blocks yields a larger reported dictionary entry size.

When we are deleting data, updating the reference counts becomes a problem. Specifically, a random io storm, that can prolong transaction group sync times Delete-s are not throttled sufficiently, because, in the absence of dedup, they would not impose a significant load.

In my experience, "lucky dedup" (where one block
accidentally happens to be identical to a totally unrelated other block) just does not happen all that much in the wild. For this reason, using smaller blocks to achieve better dedup is not useful.

However, what _does_ happen, is identical files, or files that begin the same way. Notice I am not saying "end the same way" because the identical bytes in the middle or the end will likely be misaligned and thus not recognized.

This is a disadvantage of block based, in-line dedup.

But, there are some workloads where dedup can be very useful: e.g.: template based vmdk deployment, or storing backup data on a zfs system.

In these cases, it is often possible to achieve 10x, or better dedup.

The dedup ratio is computed based on the singleton and other tables, and completely ignores any blocks that do not have dedup dictionary entries. Thus the blocks of all file systems with "dedup=off" are effectively omitted from the calculation.

For accurate reporting and minimal performance impact on non-deduped data, the best thing to do is placing all dedupe-able file sytems, and volumes into a separate zpool, where dedup is turned on pool-wide.

There are some possible mitigation strategies that could be taken (all code changes), and I am hopeful that zfs will evolve in this direction. E.g.: reference counts do not necessarily need to be stored with the dedup entries, which have zero locality.

Separating the refcounts could allow them to be stored with a different scheme which could have a higher degree of locality.
The dedup dictionary entries are as big as they are in the first place, because they (like most metadata) are stored in triplicates, which is arguably not necessary. (Any desired data redundancy can be relegated to the vdev layer, and handled explicitly via mirroring or RAIDZ)