Request for Assistance in Replicating Re-Identification Risk Experiment #267

yashmaurya01 · 2023-10-30T03:19:01Z

Hello,

I'm attempting to replicate the re-identification risk experiment detailed in the paper "Measuring Re-identification Risk." However, I'm encountering difficulties in accessing the Million Song Dataset, which was used in the empirical analysis. Unfortunately, the Echonest website appears to be down, preventing me from obtaining the necessary API key to access the dataset.

I would greatly appreciate any guidance on how to obtain the Million Song Dataset and replicate the experiment's results. Additionally, I'm seeking information on attribute mappings for the MSD dataset, specifically to simulate a scenario similar to the Topics API, which requires data such as browser history and the frequency of visits within a week for topic calculations.

Thank you for your assistance.

Best regards,
Yash Maurya

aleepasto · 2023-11-02T16:24:49Z

Thank you for your interest in our paper "Measuring Re-identification Risk.". We are delighted to see interest in the research community in replicating our work and we are happy to assist you.

First, we would like to clarify that the MSD dataset was used in the paper exclusively for the purpose of allowing the academic researchers to experiment with a public dataset using our open source code. For this reason, we used a public dataset that has been part of many academic papers in the past. We did not intend however the MSD dataset to be considered similar or related to the Topics API, since the dataset is not based on browsing histories or Topics API outputs.

We refer to Section 8.5 of the paper (Measuring Re-identification Risk) where we discuss specifically how we generated samples from the MSD dataset to test the probability of matching correctly a sample based on the song ids. Notice that this data generation process is not similar to the Topics API sampling method, and the results on this dataset do not have implications for the re-id risk of the Topics API.

Concerning the data, the dataset appear to be still available at this repository http://millionsongdataset.com/tasteprofile/

svijayakumar2 · 2023-11-08T15:15:36Z

I think the issue is with the API key. We can't access the user data without a key but since the MSD moved ownership it doesn't seem publicly accessible anymore. Do you know how to circumvent this problem or can you confirm this is the case?

aleepasto · 2023-11-10T17:56:02Z

Hi,
Thanks for the question. The specific dataset we have used appear to be available at this link (without requiring a key) http://millionsongdataset.com/sites/default/files/challenge/train_triplets.txt.zip

Please let me know if you have any other question.

suriya-ganesh · 2023-11-10T21:33:46Z

Hi @aleepasto ,
in the file you, the first and second columns seem to be some sort of Id. Were the experiments run over the ID or were the ID's decoded into their value?
Thanks

aleepasto · 2023-11-10T21:41:41Z

Hi, we use the song ids associated to a given user in the dataset without any associated meaning to the ids. As reported in the Section 8.5 of the paper, we simulate a system that outputs a sample of r songs for each user, independently, to generate two different databases. Then, we measure the match rate across the two datasets for a fixed r.

AmanPriyanshu · 2023-11-17T18:25:02Z

Hi, this discussion is really interesting. I just wanted to clarify something about the million song implementation. So ideally Topics API's re-identification is going to be based on the attack model's ability to understand user behavior. These attacks would strongly depend on the some what deterministic nature of frequency counts for topics every epoch/week.

However, I was confused about why random r songs were chosen for these users instead of applying the same frequency counting? Won't the randomness never allow any patterns to be formed?

aleepasto · 2023-11-21T16:18:48Z

Thanks for the comment. Given the fundamental differences between the MSD dataset and the real Topics API implementation we did not intend to use that MSD dataset to model in any way the Topics API. For this reason, we did not attempt to mimic any part of the API behavior (e.g., fixing top k = 5 songs per user). We only included the dataset in the paper to allow external researchers to validate the theoretical and empirical modeling of our paper in a different context.

I hope this answers your question.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Request for Assistance in Replicating Re-Identification Risk Experiment #267

Request for Assistance in Replicating Re-Identification Risk Experiment #267

yashmaurya01 commented Oct 30, 2023

aleepasto commented Nov 2, 2023

svijayakumar2 commented Nov 8, 2023

aleepasto commented Nov 10, 2023

suriya-ganesh commented Nov 10, 2023

aleepasto commented Nov 10, 2023

AmanPriyanshu commented Nov 17, 2023

aleepasto commented Nov 21, 2023

Request for Assistance in Replicating Re-Identification Risk Experiment #267

Request for Assistance in Replicating Re-Identification Risk Experiment #267

Comments

yashmaurya01 commented Oct 30, 2023

aleepasto commented Nov 2, 2023

svijayakumar2 commented Nov 8, 2023

aleepasto commented Nov 10, 2023

suriya-ganesh commented Nov 10, 2023

aleepasto commented Nov 10, 2023

AmanPriyanshu commented Nov 17, 2023

aleepasto commented Nov 21, 2023