Paper Reading 4 - PCKV: Locally Differentially Private Correlated Key-Value Data Collection with Optimized Utility

One sentence summary

The author proposed a new framework to solving some limitations in PrivKVM, achieving good experiment results on 4 real world dataset.

Background

Some giant tech-companies are collecting our private data to provide better services, but privacy concerns arise in the present era. Some researcher proposed locally private data collection model to solve this problem.

Especially for Key-Value data, there exist many challengs in Key-Value data. Here give a scenario which is the movie rating system (such as Douban and IMDB), and the problem define is:

Attribute	Description
Data Type	Each user has different number of KV pairs
Data Domain	Key in \({1,2,⋯, d}\), value in \([−1,1]\)
Task	Frequency and mean estimation
Threat Model	Honest-but-curious server
Objectives	Good privacy-utility tradeoff

The existing challenges in this problem are:

Challenges 1: Each user has different number of key-value pairs.
Challenges 2: It's hard to report the corresponding value if a fake key is reported.
Challenges 3: It's hard to design an optimal mechanism with the best privacy-utility tradeoff.

Duchi et al. proposed LDP(Local Differential Privacy) in 2013, this can be conclude as the following formula:

\[ \frac{\operatorname{Pr}(\mathcal{M}(x)=y)}{\operatorname{Pr}\left(\mathcal{M}\left(x^{\prime}\right)=y\right)} \leqslant e^{\varepsilon} \]

\(x\), \(x^{\prime}\) : Stand for the raw data generate by user
\(y\) : The perturbed data
\(\epsilon\) : Privacy budget

Warner proposed randomized response in 1965 for binary answer.

\[ \text { To satisfy } \epsilon \text { -LDP: } p=\frac{e^{\epsilon}}{e^{\epsilon}+1} \text { (since } \frac{p}{1-p}=e^{\epsilon} \text { ) } \]

\[ \mathbb{E}[f]=f^{*} p+\left(1-f^{*}\right)(1-p)=(2 p-1) f^{*}+(1-p) \]

After that, many other researchers have proposed OUE, GRR which are extended RR for general cases. GRR stand for generalized randomized response, the formula is here:

\[ \operatorname{Pr}(\mathcal{M}(x)=y)=\left\{\begin{array}{ll} p=\frac{e^{\varepsilon}}{e^{\epsilon}+d-1}, & \text { if } y=x \\ q=\frac{1-p}{d-1}, & \text { if } y \neq x \end{array}\right. \]

UE stand for Unary Encoding, the formula is here:

\[ \operatorname{Pr}(\mathbf{y}[k]=1)=\left\{\begin{array}{ll} p, & \text { if } \mathbf{x}[k]=1 \\ q, & \text { if } \mathbf{x}[k]=0 \end{array} \quad(\forall k=1,2, \cdots, d)\right. \]

Ye at al. proposed PrivKVM in 2019, the process of their framework is:

Method

- Step 1. Choose the advanced sampling protocol
- Step 2. Jointly perturb key-value and jointly analyze the privacy (which provides tight privacy budget composition)
- Step 3. Optimally put all things together (i.e., optimized privacy budget allocation under a fixed total budget)

Advanced sampling protocol: Each user pads her keys into a uniform length \(\ell\) by some dummy keys.
Joint privacy analysis: In an end-to-end way (instead of directly using sequential composition).
Optimized allocation of \(\epsilon_{1}\) and \(\epsilon_{2}\): by minimizing MSE of estimation under tight budget composition.

Evaluation

The evaluation part author conduct experiment on 4 real world dataset to test its robust

On E-commerce dataset

On Clothing dataset

On Amazon Dataset

On Movie Dataset

Reading summary

Question	Answer
What is the contribution	1. The author proposed PCKV framework，and simultanously proposed two methods: one is called *PCKV-UE, which based on unary encoding, the other called PCKV-GRR，which based on generalized randomized response*.
	2. Extend *padding-and-sampling protocol* for key-value data. Which has been verified that it is better than *sampling protocol* used in PrivKVM on large domain.
	3. Show the budget composition in key-value.
	4. The author proposed a near-optimal budget allocation approach for *PCKV-UE* and *PCKV-GRR*.
	5. Evaluate their scheme using both synthetic and real world datasets, which is shown to have higher utility (i.e., less MSE) than existing schemes
What is the strengths	1. Propose better mean of values estimate method, which can be calculated in only one round.
	2. The author's proposed method has tighter budget composition and optimized budget allocation.
	3. Use advanced sampling protocol, perform better than SOTA (e.g. PrivKVM) especially on large domain.
	4. Consumes less privacy budget overall than the budget summation of key and value perturbations.
	5. In this framework, the Mean Square Error (MSE) of frequency and mean estimations in their scheme can be theoretically analyzed, they can derive near-optimal privacy budget allocation on key and value perturbations.
What is the weaknesses	It has big variance for a smaller population.
What is the existing issue	There is no ideal strategy of choosing \(\ell\) found in *Padding-and-Sampling protocol* yet.
Insight of future work	1. Extend the framework to protect machine learning (maybe inspire on federal learning?).
	2. Extend the correlated perturbation and tight composition analysis to multi-dimensional data (such as *k-Way Marginal Probability Distribution Estimation* and *Conditional Probability Distribution Estimation*).
	3. Can this framework inspire on solving problem on social graphs, linear query answer, preference ranking and evolving data type?