Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable a site to set an optional section name #17

Open
dmarti opened this issue Jan 26, 2022 · 11 comments · May be fixed by #8
Open

Enable a site to set an optional section name #17

dmarti opened this issue Jan 26, 2022 · 11 comments · May be fixed by #8

Comments

@dmarti
Copy link
Contributor

dmarti commented Jan 26, 2022

Allow callers to specify a section name that the classifier can use to develop a topics list, to improve personalization for users of large, multi-topic sites. Callers could populate the section name in Topics API calls using the existing schema.org articleSection property already in use.

If the topic list is per-hostname, a user of a large general-interest site may receive inadequate personalization compared to a user of multiple niche sites with only a few topics per site.

A section can be any subdivision of a site, including a "channel" "group" or "space."

This is separate from the question of allowing publishers to specify individual topics. The publisher-provided "section" is just an identifier applied to a subset of pages on that site, and the actual topics for pages in that section would still have to be determined by the classifier.

@jkarlin
Copy link
Collaborator

jkarlin commented Jan 26, 2022

Thanks for opening the issue. What is the browser supposed to do differently when identifying topics in a section as compared to the hostname? If a section is available should it also include the full URL or page content or something?

@dmarti
Copy link
Contributor Author

dmarti commented Jan 26, 2022

The only thing different would be to group the topics for those pages under hostname+section instead of just hostname.

The section should not need any other content besides what the classifier normally uses -- it's just a site-provided "hint" that this sub-division of the site may have more informative topics if it is treated as its own section.

See #2 -- if a whole site gets miscategorized because the classifier is "confused" by multiple topics on the same domain, the maintainer of the site can use sections to split off groups of pages that should be treated separately by the classifier, and get more accurate topics identified on the remaining pages.

@jkarlin
Copy link
Collaborator

jkarlin commented Jan 26, 2022

But all that the classifier uses as input is the hostname. There is no page-level distinction on a site. e.g., https://example.com/a/ and https://example.com/b/ will have the same topics since they have the same hostname. Sections won't change anything.

@pugzor
Copy link

pugzor commented Jan 27, 2022

Not specific to a section name but in the same line of thought; it'd be interesting to see if various lines of intent can be determined by Topics.

For example, I've been in financial services for some time, where the vast majority of visitors to a website can be for service-based activities. I can foresee that some users may be assigned a 'Financial services' type Topic if they're casual users of the internet who mostly rely on mobile apps for most of their day-to-day needs, without a genuine interest in the area. Certain parts of a website should be 'excluded' from forming how a Topic is calculated, otherwise advertisers are going to find Topics useless for certain industries which have a high service-based component. Not always, but sometimes.

@dmarti
Copy link
Contributor Author

dmarti commented Jan 27, 2022

@jkarlin Yes, the classifier would need to use the section in addition to the hostname (It can't make assumptions based on the first pathname component, but can use the section because that is supplied by the site)

@pugzor Yes, another example is that a new site with just some basic info, a signup form, and a (long) privacy policy could end up being classified under a bunch of boring privacy law topics instead of the true topic. Putting the legal docs in their own section would make the site's top-level topics better reflect the text from the homepage.

@igrigorik
Copy link

Whatever topic is returned, will continue to be returned for any caller on that site for the remainder of the three weeks. When a site provides a section name, results will be the same across the entire site, not just within a section. (s)

Effectively, sections are custom cluster names and we need a way to differentiate clusters. The downside, as @jkarlin pointed out in (#8 (comment)), is that custom names expose new state. Ultimately though, these clusters are discarded in favor of predefined topics.. we could skip the intermediate step, I think?

Diffrent strategy, ~same outcome, PTAL: Site-seeded topics

The gist is that we can directly assign pages against predefined clusters (the topics themselves). This also allows sites to have some input into output of the classifier.

@dmarti
Copy link
Contributor Author

dmarti commented Feb 11, 2022

A large site might have multiple contributors (such as columnists or videographers) and not know in advance which contributor is planning to cover which topics.

In that case, a site could assign a section name based on the contributor name, or column or channel title, and let the classifier figure out the right topics (or, if there are not enough pages in that section to classify, use the top-level topics for the site)

It may turn out that large sites with many topics would need to use both site-seeded topics (#50) and sections for the exchange of information to be fair enough to incentivize topic-specific sites to participate.

@AramZS
Copy link

AramZS commented Feb 15, 2022

There is a schema.org property for articles that could be used for this - articleSection - on https://schema.org/Article

However, I imagine that this would be VERY easy to create fraudulent generated content site sections around for malicious publishers.

@dmarti
Copy link
Contributor Author

dmarti commented Feb 15, 2022

Yes, the section name would not be used at all for classification. It's just an identifier. If I name a section of my blog "luxury SUV test drives Mountain View" but it's all about my cat, the ML would classify those pages as "cat".

@AramZS
Copy link

AramZS commented Feb 15, 2022

I think that definitely helps conceptually prevent some misrepresentation but as we've seen in the wild it is very easy to run a set of articles through a scrambler spit out something that is almost the same as a ripped off article, and place it into a page with the associated tagging. I don't think it really solves the problem... though arguably it's very easy for a fraudster to spin up their own domain for each topic as well so I'm not sure there's really a solution that has to do with sections.... more just this is a general concern that the ML generating the topics will have to handle some other way.

@dmarti
Copy link
Contributor Author

dmarti commented Feb 15, 2022

@AramZS Yes, the scrambled or plagiarized article problem is general and not really tied to sections or even topics. (Kind of like brand safety -- it shows that none of this stuff works very well if adtech firms do a bad job of checking which sites they're willing to work with)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants