Paper Reading 4 - PCKV: Locally Differentially Private Correlated Key-Value Data Collection with Optimized Utility

One-Sentence Summary

The authors proposed a new framework that addresses several limitations of PrivKVM and achieves strong experimental results on four real-world datasets.

Background

Large technology companies collect private user data to provide better services, but this also raises serious privacy concerns. Researchers have proposed locally private data-collection models to address this problem.

Many challenges remain, especially for key-value data. Consider a movie-rating system such as Douban or IMDb, where the problem can be defined as follows:

Attribute	Description
Data Type	Each user has a different number of key-value pairs
Data Domain	Key in \({1,2,⋯, d}\), value in \([−1,1]\)
Task	Frequency and mean estimation
Threat Model	Honest-but-curious server
Objectives	Strong privacy-utility tradeoff

The existing challenges in this problem are:

Challenge 1: Each user has a different number of key-value pairs.
Challenge 2: It is hard to report the corresponding value when a fake key is reported.
Challenge 3: It is hard to design an optimal mechanism with the best privacy-utility tradeoff.

Duchi et al. proposed LDP (Local Differential Privacy) in 2013, which can be summarized as the following formula:

\[ \frac{\operatorname{Pr}(\mathcal{M}(x)=y)}{\operatorname{Pr}\left(\mathcal{M}\left(x^{\prime}\right)=y\right)} \leqslant e^{\varepsilon} \]

\(x\), \(x^{\prime}\): Raw data generated by the user
\(y\): Perturbed data
\(\epsilon\): Privacy budget

Warner proposed randomized response in 1965 for binary answers.

\[ \text { To satisfy } \epsilon \text { -LDP: } p=\frac{e^{\epsilon}}{e^{\epsilon}+1} \text { (since } \frac{p}{1-p}=e^{\epsilon} \text { ) } \]

\[ \mathbb{E}[f]=f^{*} p+\left(1-f^{*}\right)(1-p)=(2 p-1) f^{*}+(1-p) \]

Later, researchers proposed OUE and GRR, which extend randomized response to more general cases. GRR stands for generalized randomized response; the formula is shown here:

\[ \operatorname{Pr}(\mathcal{M}(x)=y)=\left\{\begin{array}{ll} p=\frac{e^{\varepsilon}}{e^{\epsilon}+d-1}, & \text { if } y=x \\ q=\frac{1-p}{d-1}, & \text { if } y \neq x \end{array}\right. \]

UE stands for unary encoding; the formula is shown here:

\[ \operatorname{Pr}(\mathbf{y}[k]=1)=\left\{\begin{array}{ll} p, & \text { if } \mathbf{x}[k]=1 \\ q, & \text { if } \mathbf{x}[k]=0 \end{array} \quad(\forall k=1,2, \cdots, d)\right. \]

Ye et al. proposed PrivKVM in 2019; the process of their framework is shown below:

Method

- Step 1. Choose the advanced sampling protocol.
- Step 2. Jointly perturb key-value data and jointly analyze privacy. This provides tight privacy-budget composition.
- Step 3. Combine everything optimally. In other words, optimize privacy-budget allocation under a fixed total budget.

Advanced sampling protocol: Each user pads her keys to a uniform length \(\ell\) using dummy keys.
Joint privacy analysis: Analyze privacy end to end instead of directly using sequential composition.
Optimized allocation of \(\epsilon_{1}\) and \(\epsilon_{2}\): Minimize estimation MSE under tight budget composition.

Evaluation

In the evaluation, the authors conduct experiments on four real-world datasets to test the framework's robustness.

On the E-Commerce Dataset

On the Clothing Dataset

On the Amazon Dataset

On the Movie Dataset

Reading Summary

Question	Answer
What is the contribution?	1. The authors proposed the PCKV framework and two methods: *PCKV-UE, which is based on unary encoding, and PCKV-GRR, which is based on generalized randomized response*.
	2. They extend the *padding-and-sampling protocol* for key-value data, which performs better than the *sampling protocol* used in PrivKVM on a large domain.
	3. They derive the budget composition for key-value data.
	4. They propose a near-optimal budget-allocation approach for *PCKV-UE* and *PCKV-GRR*.
	5. They evaluate their scheme on both synthetic and real-world datasets, showing higher utility, i.e., lower MSE, than existing schemes.
What are the strengths?	1. The framework provides a better method for estimating value means, which can be computed in only one round.
	2. The proposed method has tighter budget composition and optimized budget allocation.
	3. By using the advanced sampling protocol, it performs better than state-of-the-art methods such as PrivKVM, especially on a large domain.
	4. It consumes less privacy budget overall than simply summing the budgets for key and value perturbations.
	5. The framework supports theoretical analysis of the mean squared error (MSE) of frequency and mean estimation, allowing the authors to derive near-optimal privacy-budget allocation for key and value perturbations.
What are the weaknesses?	It has high variance for smaller populations.
What issue remains?	There is still no ideal strategy for choosing \(\ell\) in the *padding-and-sampling protocol*.
Ideas for future work	1. Extend the framework to protect machine learning (perhaps inspired by federated learning?).
	2. Extend correlated perturbation and tight composition analysis to multidimensional data, such as *k-way marginal probability distribution estimation* and *conditional probability distribution estimation*.
	3. Explore whether this framework can inspire solutions for social graphs, linear query answering, preference ranking, and evolving data types.