Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Callers getting topics according to a priority list #42

Open
lbdvt opened this issue Feb 2, 2022 · 7 comments
Open

Callers getting topics according to a priority list #42

lbdvt opened this issue Feb 2, 2022 · 7 comments

Comments

@lbdvt
Copy link
Contributor

lbdvt commented Feb 2, 2022

A caller may not get the same signal from every topic for selecting an ad, for instance "Auto insurance" may be more useful than "Vegan Cuisine".

Would it be possible for callers to provide a ranked priority list of topics, for example at a .well-known location, and for the API to return topics, if eligible, according to this priority list?

@jkarlin
Copy link
Collaborator

jkarlin commented Feb 2, 2022

I like the idea in spirit, but in practice it runs up against a privacy concern, that is if different callers on a site receive different topics, then those callers can talk to each other and quickly learn way more topics per week for a user than intended.

So, then you could imagine say that the first caller for the site for the week gets to set a preference and the others on the page are stuck with the first caller's preference. But that doesn't seem fair either. So the plan is to choose randomly.

@lbdvt
Copy link
Contributor Author

lbdvt commented Feb 3, 2022

More generally, I'm worried about the signal that can be gained from the topics, and how useful it can be for "advertising based on generic interests".

If, for instance, YouTube, Google, and Facebook call the Topics API on their pages, a very significant portion of users may have "Online Video", "Search Engines", and "Social Network" in their top 5 topics, which I don't see as very helpful for advertising.

What are your thoughts on this?

@jkarlin
Copy link
Collaborator

jkarlin commented Feb 3, 2022

I think we ought to explore this issue. As a simple idea, we could weight topics by overall frequency on the web (e.g., find the topics of pages in the HTTP Archive and weight topics inversely by frequency). This would help to overcome the issue you've described.

There are other concerns that I have as well in picking the top topics. For instance, let's say the user frequently visits pages about two different sports, but neither individual sport has enough to make it a top 5 topic for the week. But combined, they would be. Should the parent in the hierarchy, sports, then be chosen?

@stguav
Copy link

stguav commented Feb 4, 2022

Since #46 was merged, restating some points here. Let me request that we clarify the current proposal on how topics are ranked, or make the uncertainty more explicit.

The main questions there that are not explicitly here:

  • how will a user's repeated visits (within the same epoch) to the same site be treated?
  • can the classifier have (non-unit) weight output? This would allow for more nuance.
  • how will topics' weights (unit or otherwise) be added up?

Regarding the taxonomy hierarchy, one convenient way of handling it:

  • whenever a child node is present, include all parent nodes (perhaps with lower weight, see suggestion about TF-IDF, similar to your above comment).
  • this also ensures that if a caller has a more granular topic, then the caller also has access to the broader ones.
  • the downside is we may end up with users' top 5 topics being very redundant, all having parent-child relationships. (Even without this treatment, this is a concern.) One could consider some crowding/deduping logic (perhaps with noise?). Having a more diverse top 5 should improve the average utility across many calls for many users.

@jkarlin
Copy link
Collaborator

jkarlin commented Feb 9, 2022

I agree that keeping hierarchy in mind is likely to trend toward higher-level items, which is a concern. I think the TF-IDF approach has potential. Basically, we'd want to measure the inverse frequency of topics (as opposed to documents) based on user traffic. This does require knowledge of how often users visit various sites and what their topics are. Topics can be derived via the Topic API model. But the traffic data would ilkely need to come from Chrome's data which isn't public. That is, unless someone is aware of a good public dataset? I'll look into what can be done. On the bright side, the resulting list of weights would be small (~350) and each topic would be represented by a large numbers of users. So I think we'd have some pretty solid differential privacy properties with a little bit of noise.

@jkarlin jkarlin mentioned this issue May 23, 2022
@jkarlin
Copy link
Collaborator

jkarlin commented Aug 19, 2022

I’d like to stick to the topic of initial weighting for each topic based on its value before we go into hierarchical concerns, repeated topics, etc. Those seem like optimizations that should come after we have a better idea of what a topic is actually worth.

So far we’ve discussed using inverse frequency of topics as a proxy for value, but I’d like to see if we can get a more direct idea of commercial value first. Perhaps the IAB Tech Lab can help us out here.

Hey IAB Tech Lab! (@angelinaeng, @bjd326) We're pretty sure that there is room for improvement in how the Topics API weighs the user’s top 5 topics. We'd like to utilize a notion of topic value that represents the opinions of a large body of the industry. Do you have (or might you be interested in creating) some sort of indication of value for each of the topics in your content taxonomy that we could then apply to Topics as well? Even something as simple as a 1-to-5 scale of commercial utility could be a useful foundation. Perhaps a discussion we could have in an upcoming IAB meeting.

@jkarlin
Copy link
Collaborator

jkarlin commented Nov 17, 2022

Another option to get at a notion of commercial value is to use Chrome data. Chrome can determine (in many cases) when a user navigates due to an ad click. We could look at (in an aggregated, differentially private form) the topics of the ad landing pages, and note their frequency. More common landing page topics would be deemed more commercially valuable. Obviously this is imperfect (it excludes topics that may gravitate toward brand ads that don't necessarily require clicks, Chrome's heuristics miss some ads, infrequent ad categories could have huge value), but I'm confident that it would be closer to real commercial value than simple inverse topic frequency across all pages.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants