Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use topics from a meta tag on Special Topics Provider Sites #206

Open
dmarti opened this issue Jun 23, 2023 · 15 comments
Open

Use topics from a meta tag on Special Topics Provider Sites #206

dmarti opened this issue Jun 23, 2023 · 15 comments

Comments

@dmarti
Copy link
Contributor

dmarti commented Jun 23, 2023

Check to see if the page is from a Special Topics Provider Site (STPS), one that hosts content on many topics (such as youtube.com). If so:

  1. Do not use the hostname to train the classifier
  2. Check for a meta tag in the page head containing the section or channel name. Use the content of this meta tag to train the classifier instead
  3. If no meta tag containing the section or channel name is found, disable Topics API on this STPS page.

Special Topics Provider Sites could enroll, using the existing enrollment process, specifying that they want to be part of the STPS program. The browser or an independent party could crawl the site and check that the site has at least "n" pages that are classified as at least "m" different topics before adding the site to the STPS list.

(simpler solution to achieve a large fraction of the benefits of #118 with less complexity and risk)

@jkarlin
Copy link
Collaborator

jkarlin commented Jun 23, 2023

This doesn't actually address the privacy concerns from #118. Further, it picks a single site (a rather arbitrary heuristic) as opposed to applying equally web wide, which doesn't seem particularly webby. Finally, due to filtering, there would be some benefit to all from this (global top topic selection being more refined) but one would still have to observe the user on some page with that topic in order to receive it.

@dmarti
Copy link
Contributor Author

dmarti commented Jun 23, 2023

I agree that it's suboptimal to treat a single site as a special case. But as long as there is no more general approach to the YouTube problem being pursued, this would be better than nothing. Possibly other very large sites that also cover all or most topics could be special cased as well.

@michaelkleber
Copy link
Collaborator

I think this feature request should be interpreted as something like: "For some browser-chosen list of Special Topic Provider Sites, pages on those sites should be able to declare what Topics they are about, and those become available to everyone, as if every Topics caller had observed them. And also YouTube should be on that list." In this sense it's more like a restricted version of #1 than of #118.

I don't know that I agree with this proposal! — no idea whether YouTube would be interested in being a Special Topic Provider, no idea how we would determine what other sites should have the same special status, etc. But this version seems "tricky and subtle" rather than "impossible".

@dmarti
Copy link
Contributor Author

dmarti commented Jun 24, 2023

@michaelkleber That makes a lot of sense. The list doesn't have to be browser-chosen.

  1. Special Topics Provider Sites could enroll, using the existing enrollment process and specifying that they want to be part of the STPS program.
  2. The browser or an independent party could crawl the site and check that the site has at least "n" pages that are classified as at least "m" different topics
  3. If the number of pages and topics is high enough, the site is added to the STPS list.
  4. SitesPages from sites on the STPS list are classified by the content of the appropriate meta tag, not domain.

@dmarti dmarti changed the title Handle youtube.com channels as Topics API data sources Use topics from a meta tag on Special Topics Provider Sites Jun 24, 2023
@dmarti
Copy link
Contributor Author

dmarti commented Jun 26, 2023

I have rewritten the text of this issue to cover Special Topics Provider Sites, as @michaelkleber suggested. This seems like a possible path forward considering that #118 was closed, and that there still appears to be interest in fairly classifying content from large, multi-topics sites. See p. 7 of CMA update report on implementation of the Privacy Sandbox commitments, April 2023

@jkarlin
Copy link
Collaborator

jkarlin commented Jun 26, 2023

I think you can achieve the same effect with a default () permission policy that declares that the page would like to include something other than domain in its topics rather than needing to make changes to enrollment.

@michaelkleber
Copy link
Collaborator

Don, I see you're still hoping that the browser does the work of turning the "section or channel name" into topics, rather than letting the STPS just declare the page's topics directly. Is that distinction important to you?

It seems to me that the way to turn a YouTube channel name into a Topic could be very different from how you turn a hostname into a Topic. So it feels like this version of the proposal implicitly asks browsers to build a specialized STPS-to-Topics model for each Provider Site.

On the one hand, that seems like putting the work in the wrong place: Surely the site is in a good position to do a better job! On the other hand, you might worry that an STPS would be able to abuse this by maliciously giving out the wrong topics — but if you're letting them control the "section or channel name" input and the model is public, then surely it would be easy for them to maliciously push false topics either way.

@dmarti
Copy link
Contributor Author

dmarti commented Jun 26, 2023

Hi @michaelkleber -- I don't know. On one hand, it seems like the choice of whether or not to allow sites or channels to choose their own topics should apply to both sites and channels or to neither. Some hostnames provide usable Topics API information to the classifier, and others don't. Some YouTube channel names provide usable information to the classifier, and others don't. (For example, Jalopnik dot com is about cars, but it's a made-up word so doesn't get classified, last I checked. And the YouTube channel "LazerPig" is not about lasers or pigs. Other site and channel names have better keywords in them.) You might be able to use the same classifier for hostnames and channels/sections if STPSs had to transform the channel name into something that would be a valid hostname ("My YouTube Channel" becomes "my-youtube-channel" or similar)

On the other hand, there are relatively few STPSs and it would be fairly straightforward to spot-check how accurately they were assigning topics to each channel, so it might be fine to have STPSs pass topics directly.

@jkarlin Yes, that seems to be another workable option.

@michaelkleber
Copy link
Collaborator

On one hand, it seems like the choice of whether or not to allow sites or channels to choose their own topics should apply to both sites and channels or to neither.

Hmm, the two questions feel quite different to me. Changing a domain name is both much harder and much more user-visible than changing an invisible meta tag on a page, for example. Using something user-visible seems like a huge contributor to maintaining quality of input data.

But a lot of this comes around to the question of what qualifications a site would need to have to be a STPS. Besides just being large and heterogeneous, if we think it would include a site being more "reputable" in some way, then perhaps that reputation would lead us to expect a lower chance being pushed useless/fabricated topics. (OTOH would you let Reddit onto the list? Seems all-but-guaranteed that some subreddits would claim a random absurd topic for each pageview.)

@dmarti
Copy link
Contributor Author

dmarti commented Jun 26, 2023

@michaelkleber Yes, I agree about the Reddit problem (one of the current best international news subreddits has a deliberately embarrassing and NSFW name in an effort to avoid ads, and they would probably pass the most embarrassing possible topics too). But there are few enough STPSs that the browser (or other STPS list maintainer) could check the privacy policy for whether it covers passing best-effort accurate topics or something else, and spot-check what the site is actually passing.

Some sites that are eligible to be STPSs will probably not see a reason to do it until some other party offers them an incentive to more accurately classify their audiences. In that case the other party will be in a position to require and check that the STPS is passing accurate topics, and the browser won't need to enforce.

@michaelkleber
Copy link
Collaborator

the browser (or other STPS list maintainer) could check the privacy policy for whether it covers passing best-effort accurate topics or something else, and spot-check what the site is actually passing.

This strikes me as very unappealing, and we should do whatever we can to avoid ending up in that position.

@dmarti
Copy link
Contributor Author

dmarti commented Jun 26, 2023

Yes, but it's less unappealing the fewer privacy policies you have to read. The number of pages and topics required for STPS status can be set high enough to keep the work on the browser (or independent evaluator) easily manageable, and not all sites eligible for STPS will apply.

@jkarlin
Copy link
Collaborator

jkarlin commented Jun 28, 2023

If we were to go in the direction of allowing metadata, then it might make sense to do so in a page-level opt-in way to address privacy concerns. My primary concern there is that I imagine very few pages would opt in, as it's unclear what their incentive would be. And without a significant user base, it's hard to justify the costs of training the new model and having it sit on users devices.

@dmarti
Copy link
Contributor Author

dmarti commented Jun 28, 2023

Hi @jkarlin, yes, that's a good point. There are at least two scenarios in which a large, multi-topic site will choose page-level opt-in or STPS.

  • Competition regulators require page-level opt-in and/or STPS when a company owns both a Topics API browser and a large, general-interest site that would otherwise benefit in an illegal or questionable way from domain-based Topics API training

  • Adtech intermediaries compensate large, general-interest publishers for providing additional data that they can use to increase ad revenue on other sites (in this case the intermediary is motivated to check on the site's topics, so there would be little administrative burden on the browser maintainers)

The first scenario is the one that seems to be the immediate problem. I know that either opt-ins or STPS would represent additional development work, but realistically considering the time required for browser development tasks compared to the time required for regulator and lawyer meetings, it seems to me that it's worth the additional time to implement Topics API in a way that takes some meaningful steps toward treating niche sites and YouTube channels in a comparable way.

@dmarti
Copy link
Contributor Author

dmarti commented Jul 24, 2023

Added #224 to cover the opt-in suggested by @jkarlin

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants