⚠️ This document is mostly outdated, the source of truth is ‣ ⚠️

Introduction

IPFS is currently lacking of many privacy protections. One of its main weak points currently lies in the lack of privacy protections for the DHT content routing subsystem. Currently in the IPFS DHT, neither readers (clients accessing files) nor writers (hosts storing and distributing content) have much privacy with regard to content they publish or consume. It is very easy for a DHT server node or a passive observer to learn which file is requested by which client during the routing process, as the potential adversary easily learns about the requested CID. A curious actor could request the same CID and download the associated file to monitor the user’s behavior. This is obviously undesirable and has been for some time now a strong request from the community.

The changes described in this document introduce a DHT privacy upgrade boosting the reader’s (client’s) privacy. It will prevent DHT tracking as described above, and add Provider Records Authentication. The proposed modifications will also add a slight Writer Privacy improvement as a side effect.

Ideally, this change will be included in IPFS Reframe, improving reader’s privacy not only in the DHT lookup process but also in the context of Indexers and Delegated Routing.

Routing in IPFS

One of the primary components of IPFS is its Content Routing module. Files are identified by their Content IDentifier, CID , which is derived from the hash of their content. Pointers to the hosts storing the content identified by the CIDs, or Provider Records are stored in a global Distributed Hash Table (DHT). The IPFS DHT is an implementation of the Kademlia DHT.

Double Hashing

The Provider Records are stored in the DHT at the location of the SHA256 hash of the content’s multihash (MH) which is part of the CID , hence, “double-hashed”. Provider Records are stored in the $k_{repl}=20$ DHT nodes whose Hash(peerID) is the closest ones to Hash(MH) based on the XOR distance. Hence, the pointer to the content provider is located at the second hash of the file’s content Hash(Hash(Content)).

Content lookup

In order to retrieve a file, a client must know the file’s CID containing MH. The client will check in its own routing table those peers whose Hash(peerID) is the closest to Hash(MH). Then it will send a request to these DHT server peers, indicating that it is looking for CID. The DHT server peers will extract MH from CID, compute Hash(MH) and return to the client the peers from their own routing table whose Hash(peerID) is the closest to Hash(MH). The client will then contact these peers until it finds the DHT server peer that host the Provider Record associated with CID. The DHT server nodes storing the Provider Record associated with CID will simply send the Provider Record to the client upon request. At this point, the client knows the peerID of the node hosting the content identified by CID. If the client doesn’t know the multiaddress of the content providers, it must perform another DHT lookup to resolve the mutliaddress of the peer from its peerID. It will then request CID to the content provider using Bitswap.

In reality, the DHT server nodes return the $k_{repl}$ closest nodes to Hash(MH) and not just one as in the simplified example. It does so to accelerate the routing process to learn about all the nodes storing a Provider Record. $k_{repl}$ is only referenced in this document and is the k replication factor. It is one of the magic k values of Kademlia (with the other one being $k_{bucket}$). $k_{bucket}$ is the size of Kademlia k-buckets. In IPFS we have $k_{repl} = k_{bucket} = 20$.

Benefits of Double Hashing

This section contains the list of benefits coming with the proposed change. This is mostly an overview as the changes are described in details in the following sections.

Reader Privacy

Currently adversaries can (1) passively observe the client’s traffic, (2) act as DHT server peers to help resolve a query and (3) act as DHT server peers serving the requested provider record and therefore, monitoring the CID that a client is trying to access. If they haven’t seen this CID before, they can in turn request it to learn what content was requested by a specific client. This is a significant concern to reader’s privacy as all the requested content can be monitored.

We want to provide better client privacy guarantees, while keeping overhead at a minimum. Hence, we propose 2 changes to improve reader privacy in IPFS. Note that these changes give more privacy to a client for all three cases mentioned above. They provide privacy from passive observers - case (1), DHT server peers that act as intermediate routers for the request - case (2) and DHT server peers that serve the provider records - case (3). The proposed changes do not protect the client’s privacy from the content provider though.

The DHT server peers serving the provider record could guess that client X is accessing content from provider Y, but don’t know which content. If Y published a single content identifier, the DHT server peers serving the provider record can trivially learn which content X is accessing.