
One-Sentence Summary
The authors proposed a new framework that addresses several limitations of PrivKVM and achieves strong experimental results on four real-world datasets.
Background
Large technology companies collect private user data to provide better services, but this also raises serious privacy concerns. Researchers have proposed locally private data-collection models to address this problem.

Many challenges remain, especially for key-value data. Consider a movie-rating system such as Douban or IMDb, where the problem can be defined as follows:
| Attribute | Description |
|---|---|
| Data Type | Each user has a different number of key-value pairs |
| Data Domain | Key in \({1,2,⋯, d}\), value in \([−1,1]\) |
| Task | Frequency and mean estimation |
| Threat Model | Honest-but-curious server |
| Objectives | Strong privacy-utility tradeoff |
The existing challenges in this problem are:
- Challenge 1: Each user has a different number of key-value pairs.
- Challenge 2: It is hard to report the corresponding value when a fake key is reported.
- Challenge 3: It is hard to design an optimal mechanism with the best privacy-utility tradeoff.
Related Work
Duchi et al. proposed LDP (Local Differential Privacy) in 2013, which can be summarized as the following formula:
\[ \frac{\operatorname{Pr}(\mathcal{M}(x)=y)}{\operatorname{Pr}\left(\mathcal{M}\left(x^{\prime}\right)=y\right)} \leqslant e^{\varepsilon} \]
- \(x\), \(x^{\prime}\): Raw data generated by the user
- \(y\): Perturbed data
- \(\epsilon\): Privacy budget
Warner proposed randomized response in 1965 for binary answers.
\[ \text { To satisfy } \epsilon \text { -LDP: } p=\frac{e^{\epsilon}}{e^{\epsilon}+1} \text { (since } \frac{p}{1-p}=e^{\epsilon} \text { ) } \]
\[ \mathbb{E}[f]=f^{*} p+\left(1-f^{*}\right)(1-p)=(2 p-1) f^{*}+(1-p) \]
Later, researchers proposed OUE and GRR, which extend randomized response to more general cases. GRR stands for generalized randomized response; the formula is shown here:
\[ \operatorname{Pr}(\mathcal{M}(x)=y)=\left\{\begin{array}{ll} p=\frac{e^{\varepsilon}}{e^{\epsilon}+d-1}, & \text { if } y=x \\ q=\frac{1-p}{d-1}, & \text { if } y \neq x \end{array}\right. \]
UE stands for unary encoding; the formula is shown here:
\[ \operatorname{Pr}(\mathbf{y}[k]=1)=\left\{\begin{array}{ll} p, & \text { if } \mathbf{x}[k]=1 \\ q, & \text { if } \mathbf{x}[k]=0 \end{array} \quad(\forall k=1,2, \cdots, d)\right. \]
Ye et al. proposed PrivKVM in 2019; the process of their framework is shown below: 
Method
- Step 1. Choose the advanced sampling protocol.
- Step 2. Jointly perturb key-value data and jointly analyze privacy. This provides tight privacy-budget composition.
- Step 3. Combine everything optimally. In other words, optimize privacy-budget allocation under a fixed total budget.

- Advanced sampling protocol: Each user pads her keys to a uniform length \(\ell\) using dummy keys.
- Joint privacy analysis: Analyze privacy end to end instead of directly using sequential composition.
- Optimized allocation of \(\epsilon_{1}\) and \(\epsilon_{2}\): Minimize estimation MSE under tight budget composition.
Evaluation
In the evaluation, the authors conduct experiments on four real-world datasets to test the framework's robustness.
On the E-Commerce Dataset

On the Clothing Dataset

On the Amazon Dataset

On the Movie Dataset

Reading Summary
| Question | Answer |
|---|---|
| What is the contribution? | 1. The authors proposed the PCKV framework and two methods: PCKV-UE, which is based on unary encoding, and PCKV-GRR, which is based on generalized randomized response. |
| 2. They extend the padding-and-sampling protocol for key-value data, which performs better than the sampling protocol used in PrivKVM on a large domain. | |
| 3. They derive the budget composition for key-value data. | |
| 4. They propose a near-optimal budget-allocation approach for PCKV-UE and PCKV-GRR. | |
| 5. They evaluate their scheme on both synthetic and real-world datasets, showing higher utility, i.e., lower MSE, than existing schemes. | |
| What are the strengths? | 1. The framework provides a better method for estimating value means, which can be computed in only one round. |
| 2. The proposed method has tighter budget composition and optimized budget allocation. | |
| 3. By using the advanced sampling protocol, it performs better than state-of-the-art methods such as PrivKVM, especially on a large domain. | |
| 4. It consumes less privacy budget overall than simply summing the budgets for key and value perturbations. | |
| 5. The framework supports theoretical analysis of the mean squared error (MSE) of frequency and mean estimation, allowing the authors to derive near-optimal privacy-budget allocation for key and value perturbations. | |
| What are the weaknesses? | It has high variance for smaller populations. |
| What issue remains? | There is still no ideal strategy for choosing \(\ell\) in the padding-and-sampling protocol. |
| Ideas for future work | 1. Extend the framework to protect machine learning (perhaps inspired by federated learning?). |
| 2. Extend correlated perturbation and tight composition analysis to multidimensional data, such as k-way marginal probability distribution estimation and conditional probability distribution estimation. | |
| 3. Explore whether this framework can inspire solutions for social graphs, linear query answering, preference ranking, and evolving data types. |