Jump to content

Talk:Spam blacklist/Archives/2008-06

From Meta, a Wikimedia project coordination wiki
(Redirected from Talk:Spam blacklist/Archives/2008/06)
Latest comment: 15 years ago by Erwin in topic Discussion

Proposed additions

This section is for completed requests that a website be blacklisted

christianthomaskohl pseudoscience vanispamcruftisement





A couple of cross-wiki spammers, there's a whole bunch of one-wiki spammers at the WT:WPSPAM link.






Full report at w:WT:WPSPAM#christianthomaskohl (permanent link), see also User:COIBot/LinkReports/ctkohl.googlepages.com and User:COIBot/LinkReports/christianthomaskohl.googlepages.com. Pseudoscience, blatant self-promotion. 143.238.211.63 11:34, 26 May 2008 (UTC)

Agree with the proposed addition. Adambro 12:30, 26 May 2008 (UTC)
Thanks - Added  – Mike.lifeguard | @en.wb 20:51, 27 May 2008 (UTC)

Turkish real estate crossi wiki spam



Works via dynamic turkish IP-addies. Per wikiversion the language version of the site, f.e. on wiki-nl it is http: // www.antalyahomes.com/default.asp?lang=nl are added. Kind regards, MoiraMoira 07:53, 27 May 2008 (UTC)

Agreed, cross wiki & active. Cleaned & Added Added thanks --Herby talk thyme 08:10, 27 May 2008 (UTC)

Multiple sites

Cross-wiki spam from 221.218.131.237 xwiki-contribsxwiki-date (alt)STIP infoWHOISrobtexgblockglistabuselogbullseye, adding multiple spamlinks to sites that contain the word "sex".



















Best regards, --birdy geimfyglið (:> )=| 09:02, 30 May 2008 (UTC)

OK, that is enough, also summarised in User:COIBot/XWiki/fuzokudx.com. Is there anything useful here? --Dirk Beetstra T C (en: U, T) 09:08, 30 May 2008 (UTC)
Added Added. --Dirk Beetstra T C (en: U, T) 09:16, 30 May 2008 (UTC)
Another IP:


--Dirk Beetstra T C (en: U, T) 10:53, 30 May 2008 (UTC)

folkmusicindia.com



Spammers




Plus more. See w:WT:WPSPAM#folkmusicindia.com (permanent link). MER-C 13:39, 30 May 2008 (UTC)

Added Added. --Dirk Beetstra T C (en: U, T) 13:43, 30 May 2008 (UTC)
Thanks.--Hu12 10:26, 31 May 2008 (UTC)

giacomo.lorenzoni.name



See also;

Thanks, --Hu12 10:26, 31 May 2008 (UTC)
Indeed. Very cross wiki & usually removed from projects so not required as far as I can see. Added Added thanks --Herby talk thyme 10:31, 31 May 2008 (UTC)
Added more And has been cleared from all projects, thanks--Hu12 11:22, 31 May 2008 (UTC)

viartis.net

I thought this troublesome domain had been deep-blacklisted long ago; apparently not:

Background

Account (this time):



--A. B. (talk) 02:51, 1 June 2008 (UTC)

A. B., you can next time just add this to the report in 'User:COIBot/XWiki/domain' (User:COIBot/XWiki/viartis.net in this case). There is a comment in there, all discussion behind that comment is not deleted, and just retained if the report is regenerated.
In all cases, Added Added. --Dirk Beetstra T C (en: U, T) 19:04, 1 June 2008 (UTC)

mybuys.com

I spotted a block & page deletions here (here). There was one link on the top 10 search on en wp which I have removed. I do not see this as a valuable link for the project. It is more a case of the potential of spam than actual so other views welcome.




Thanks --Herby talk thyme 07:27, 2 June 2008 (UTC)

Link has not been added for as far as the database of the linkwatchers go back (and are complete). I think we should not blacklist things when there is no abuse, but there is nothing wrong in monitoring it. --Dirk Beetstra T C (en: U, T) 13:04, 2 June 2008 (UTC)
Agreed - it is not useful, but until it is abused. I will (try to) add it to the bots if not already done.  – Mike.lifeguard | @en.wb 01:15, 3 June 2008 (UTC)

More satanismresource spam

satanismresource.fortunecity.com redirects to blacklisted domain geocities.com/satanismresource.

See Talk:Spam blacklist/Archives/2008/05#geocities.com/satanismresource





--A. B. (talk) 03:07, 3 June 2008 (UTC)


Added Added --A. B. (talk) 03:26, 3 June 2008 (UTC)


HedgeLender LLC spam

Spam domains




Related domains


















Accounts










Reference

--A. B. (talk) 04:52, 3 June 2008 (UTC)

I agree this is cross-wiki spam and warrants listing. I think in this case it may be worthwhile to add the related domains too. Or just the ones spammed?  – Mike.lifeguard | @en.wb 14:28, 3 June 2008 (UTC)
I didn't have time to blacklist myself and I'm in meetings all day today. Yes, I think they should all be blacklisted. --A. B. (talk) 17:17, 3 June 2008 (UTC)
Added Added  – Mike.lifeguard | @en.wb 01:49, 4 June 2008 (UTC)

Tangram software seller from spain



Spams since april wikiwide all Tangram pages to sell his software. Uses dynamic spanish IP-addies - see here for the IP-numbers used up until tonight. TIA and kind regards, MoiraMoira 18:28, 3 June 2008 (UTC)

Agreed and blacklisted. --Erwin(85) 20:18, 3 June 2008 (UTC)

muzikfakultesi.com-related spam

Spam domains











Related domains











Google Adsense: 6158286478265594
There appear to be many more related domains


Spam accounts







Reference

--A. B. (talk) 04:01, 5 June 2008 (UTC)

Added Added --A. B. (talk) 04:25, 5 June 2008 (UTC)

Gallomedia spam

Spam domain



Related domains





















Accounts



Reference

--A. B. (talk) 04:11, 5 June 2008 (UTC)

Added Added --A. B. (talk) 04:26, 5 June 2008 (UTC)

supermodels.nl



Although this site has been blacklisted, there are still plenty of links on Wikipedia and it would be great if the removal could be done by a bot. According to Finjan Secure Browsing([see this screenshot]), AVG and several board threads, this site is infested with malware and badware. Besides that this site is unaccesible at times and whensoever it keeps loading and loading. Robomod 20:58, 7 June 2008 (UTC)

This site is not blacklisted at meta, but at enwiki. Should we consider adding it here? I will take a look at removing some links now.  – Mike.lifeguard | @en.wb 21:10, 7 June 2008 (UTC)

Agrred in a sense - however if it contains malware then prevention is better than cure - we have listed on that basis in the past. Added Added for now - can always be removed when the problem is clarified/dealt with - cheers --Herby talk thyme 06:52, 9 June 2008 (UTC)

cccb.org



Spammers


Massive spam page creation and linkspamming across several projects from a confessed paid editor. See w:WT:WPSPAM#spam.cccb.org (permanent link).

I'd like a steward to deal with this request in order to unify and lock the account. MER-C 09:49, 9 June 2008 (UTC)

I Added Added this to the blacklist. --Dirk Beetstra T C (en: U, T) 12:11, 9 June 2008 (UTC)

maskmelin.livejournal.com

Spammed on en and pt.wikipedias:

Spam domain


Spam accounts








Reference

--A. B. (talk) 01:56, 10 June 2008 (UTC)

Added Added --A. B. (talk) 02:01, 10 June 2008 (UTC)

stadiumzone.net

Spam domain


Spam accounts

Spammer has been repeatedly warned and eventually blocked on en.wikipedia, but to no avail:



Reference

I have not had time to check for other domains this spammer may have added. --A. B. (talk) 05:24, 10 June 2008 (UTC)

Added Added --A. B. (talk) 05:28, 10 June 2008 (UTC)

crosswiki commercial learning site via french ip addie



MoiraMoira 07:45, 10 June 2008 (UTC)

The site was spammed by



The domain has been Added Added. Thanks for the heads up. Could you please include the address next time though? --Erwin(85) 08:43, 10 June 2008 (UTC)

anontalk.com

Too many to list, and they don't link the site, so it would be best if the entire string "anontalk" or "www.'''AnonTalk.com'''" that's usually used could be blocked from entering into any wp article. I don't know how much this has been spammed, but I'm active on svwp and we constantly have to block new proxies spamming this. /Grillo 15:46, 11 June 2008 (UTC)"



Could you give some examples, we don't need all, we can probably poll the databases to find all other abuse. Thanks! --Dirk Beetstra T C (en: U, T) 19:07, 11 June 2008 (UTC)
For reference:
--Jorunn 21:01, 11 June 2008 (UTC)
Spam on a single project should be dealt with at the local blacklist.
This blacklist is used by more than just our 700+ Wikimedia Foundation wikis. All 3000+ Wikia wikis plus a substantial percentage of the 25,000+ unrelated wikis that run on our MediaWiki software have chosen to incorporate this blacklist in their own spam filtering. Each wiki has a local blacklist which affects that project only.
This request is Declined. Please use w:sv:MediaWiki:Spam-blacklist. If your administrators need help using the blacklist, they may see mw:Extension:SpamBlacklist or visit #wikimedia-admin (invite required).  – Mike.lifeguard | @en.wb 02:15, 12 June 2008 (UTC)
See for instance en:Special:Contributions/202.44.32.9 (recent edits). 81.232.56.250 08:52, 12 June 2008 (UTC)

It is cross wiki - I've seen it in a number of places (& it is on en wp BL Ithink) but the links I've seen are never clickable so blacklisting will not have any effect I'm afraid (as far as I know) --Herby talk thyme 09:02, 12 June 2008 (UTC)

No, the blacklist affects only "real" links.  – Mike.lifeguard | @en.wb 20:24, 12 June 2008 (UTC)
I'll have a nice life reverting proxies from Anontalk then. Thanks for your time. /Grillo 18:40, 13 June 2008 (UTC)

crosswiki commercial holiday homes rental site



The site was spammed by



Caught in the act today, cross wiki active since 2007. Kind regards, MoiraMoira 09:17, 12 June 2008 (UTC)

Nice catch thanks Added Added but maybe needs clearing (I've done en wp but have no more time now) --Herby talk thyme 09:33, 12 June 2008 (UTC)

bidmonfa.com



Spammers


















  • plus more, see the WikiProject Spam item

See w:WT:WPSPAM#spam.bidmonfa.com (permanent link). I've only cleaned the links that were spammed, for the rest see the WikiProject Spam item. MER-C 11:27, 12 June 2008 (UTC)

Thanks & Added Added. Will take a look at removing excess links shortly.  – Mike.lifeguard | @en.wb 21:52, 12 June 2008 (UTC)


Crosswiki Tucson, Arizona spammer in french





See contribs via toolserver, spammed 26 times on 18 wikiprojects. EdBever 06:34, 14 June 2008 (UTC)

Yes, good catch - agreed & Added Added thanks --Herby talk thyme 07:04, 14 June 2008 (UTC)

crosswiki commercial tourist site



The site was spammed by



Caught in the act today, done via university server. Kind regards, MoiraMoira 07:19, 16 June 2008 (UTC)

Four projects have been spammed, so I'm reluctant to immediately blacklist it. So I haven't. See also User:COIBot/XWiki/burasibalikesir.net. --Erwin(85) 09:54, 16 June 2008 (UTC)

jehovah-shamah.wikidot.com



is being spammed by



There's no bot report, so I don't want to blacklist it myself without being able to link somewhere in the log. So can someone else please check his global contributions and consider blacklisting? I've reverted most or all of his edits. --Erwin(85) 11:04, 16 June 2008 (UTC)

per Luxo, user already deserved quite some block. Should say, does a lot of 'advertising' for his link on talkpages. Added Added --Dirk Beetstra T C (en: U, T) 11:12, 16 June 2008 (UTC)

is.gd



URL redirector. Ohnoitsjamie 18:14, 16 June 2008 (UTC)
Thanks, Added Added  – Mike.lifeguard | @en.wb 19:06, 16 June 2008 (UTC)

crosswiki commercial broadcast site





The site was spammed by



Caught in the act today, various cyclist tour sub pages placed everywhere. Kind regards, MoiraMoira 07:19, 16 June 2008 (UTC)

Actually it was unreverted mostly. That said I've now removed it :) Not listing for now but one to keep an eye on & when the bot is active again I'll poke it with this one I think. Thanks --Herby talk thyme 07:45, 19 June 2008 (UTC)
Meant to come back on this one but have been rather busy. I had an email from the site owner - I think I have convinced them that such cross wiki link placement is not a good idea. Can be archived soon - if it comes back it can be listed. Cheers --Herby talk thyme 11:42, 24 June 2008 (UTC)

Thaise Apple iPhone spam





Crosswikispam, caught in the act MoiraMoira 07:27, 24 June 2008 (UTC)

Same here, I've already Added Added it. Thanks for the heads up though. --Erwin(85) 07:31, 24 June 2008 (UTC)

admiroutes.asso.fr/larevue/2007/83/kohl.htm

More Christian Thomas Kohl nonsense. See w:WT:WPSPAM#christianthomaskohl (permanent link). Everything else in the WPSPAM item is already blacklisted here. MER-C 04:17, 20 June 2008 (UTC)

Thanks - I'd just spotted it on en wp & was coming here to do it! Added Added, I guess there will be more --Herby talk thyme 06:54, 20 June 2008 (UTC)

Proposed removals

This section is for archiving proposals that a website be unlisted.

hnl-statistika.com

Please, remove it from black-list [1] because it is one of main sources of croatian soccer statistics. From ru-wiki, --Munroe 12:11, 28 May 2008 (UTC)

I considered that when I added it, I will go forward and remove it. Removed Removed --Dirk Beetstra T C (en: U, T) 12:33, 28 May 2008 (UTC)
Thanks! --Munroe 13:00, 28 May 2008 (UTC)

natureperu.com

The link to the photos of coca tea, was the only one existent page of filtering products extracted of the coca sheet in the Coca Page. A problem for ignorance of wikipedia's procedure for some anonymous user cannot harm the publication of information that can interest many people. -- User:Jbricenol (talk) 27 May 2008 20:57 (UTC)

Please see User:COIBot/XWiki/natureperu.com and Talk:Spam_blacklist/Archives/2008/05#Natureperu.com.  – Mike.lifeguard | @en.wb 00:11, 28 May 2008 (UTC)
The content of the pages is not enhanced by adding links to pictures of packs of tea, especially not commercial packs of tea (I quote: "Coca Tea Manufactured by Enaco S.A. and Sold by NaturePeru.com"). You clearly clicked your way through the interwiki links on coca, and added only this link, and I don't know which language 'et.wikipedia' is, but you decided not to translate but just copy and paste the section 'photos' in English.
One of these images could have been nice on the article 'tea pack', 'tea packaging' or something similar (if those articles exist), and maybe on 'coca tea', but the links are not adding anything. Not done. --Dirk Beetstra T C (en: U, T) 09:13, 28 May 2008 (UTC)


Upon investigating this request, I found that there were additional, related domains not blacklisted at the time:


  • This domain has also been spammed




Additional accounts:








--A. B. (talk) 02:32, 1 June 2008 (UTC)


Additional 3 domains Added Added --A. B. (talk) 01:53, 3 June 2008 (UTC)

outrate.net

My site Outrate.net was blacklisted late in 2006 due to what was percieved as "linkspamming" as I'd added links to a number of wikipedia pages rapidly, without realising this was against policy.

The blacklist was upheld on further complaint that the site had excessive adult-oriented advertisements.

The advertisements have been completely removed from the site, which is a content rich site filled with hundreds of film reviews written speicifically for the site, interviews with celebrities conducted by the editor of the site and pulbished exclusively there, a short film festival hosted on the site, etc.

Pages from Outrate.net that belong as External links on wikipedia include interviews with figures like Billy Hayes, author, and so on , and these pages are what I'd like to add back in.

Is a removal of the blacklist on Outrate.net at all possible?

Mark

Unlisting is possible, but we generally do not remove links when requested by someone involved in the site. May I suggest that you contact a/some wikiproject(s) on some wikipedia (or other places where the use of your links could be discussed, for en wikipedia I would think about en:Wikipedia:WikiProject Films, I don't know how or what other wikis do), see if they deem the link useful, and then report back here (or ask an established user to request removal). --Dirk Beetstra T C (en: U, T) 13:02, 2 June 2008 (UTC)

thank you for your reply. I don't know anyone who's a wikipedia user, or quite how to go about your recommendations above. Is there another option here? The preceding unsigned comment was added by 202.3.37.98 (talk • contribs) 13:47, 2 Jun 2008 (UTC)

There is a sense here in which you explain your own problem. You don't really know how Wikipedias work. If you have valuable knowledge on a subject you would be best getting involved on the pages relating to that subject on a language Wikipedia that you are fairly fluent in. Once you have become involved you will know other contributors & they will know you. It will then be possible for them to decide whether your site provides something that is of sufficient interest to warrant external links.
If you are unable to do that then I regret your site will remain on this list because, as said above, we do not remove sites at the request of those involved with them, sorry --Herby talk thyme 07:09, 3 June 2008 (UTC)

Nothing further heard so closed as  Declined --Herby talk thyme 07:54, 7 June 2008 (UTC)

podiatryworldwide.com

Hi,

My name is Pierre,

I made a neutral - non profit - scientifical - institutional website.

www.podiatryworldwide.com Global Podiatry Worldwide Directory

I made this websites in way to make what was missing on internet in podiatry speciality.

(In way to develop knowledge, this website simply try to rationnaly index, organize and reference all of the major scientifical Internet sites relating to podiatry by continent, country and specialty. Its intent is to develop contacts and links, to exchange information and to organize and standardize knowledge between all of the countries professionals of the world in this era of globalization)

I tried to begin to make external links through wikipedia by means of about 15 keywords in many languages relating to podiatry such as podiatry, orthapedic, foot ... but i was black listed.

I would like to know what would be possible to do. I would like to know what is acceptable or not. I would like to know what i can do to go on index on wikipedia.

Because if i dont do this on wikipedia, i ask myself where else i can do that but on wikipedia.

Thank you

Yes, well our projects are not link directories. You may find DMOZ useful - it is a directory of links. You may find them at http://www.dmoz.org/.  – Mike.lifeguard | @en.wb 00:12, 5 June 2008 (UTC)
I should also point out that the domain is not blacklisted currently.  – Mike.lifeguard | @en.wb 00:14, 5 June 2008 (UTC)

drupalmodules.com

See User:COIBot/XWiki/drupalmodules.com

Removed Removed --Dirk Beetstra T C (en: U, T) 09:35, 3 June 2008 (UTC)

encyclopediadramatica\.(com|net|org)

Articles exist on

link is on the blacklist because of on-wiki politics at en, not because of spam. Ignoring the problem of using a spam blacklist for wiki-politics, the politics are over. The en ArbCom has said it is up to the community, and the community (on the talk page of the now existing article) has overwhelmingly said it should be linked to as standard practice. It should be removed from the blacklist and not returned. SchmuckyTheCat

Link was blacklisted by ArbCom in order to protect the project and its usership from harm, abuse and attacks. Attempting to claim it as "wiki-politics" is both dangerous and irresponsible. The "community" of which you speak is an isolated handful of editors who have persued interest in this particular topic, and in no way reflects the 700+ Wikimedia Foundations multi project usership. Because it has an article on the english whikipedia, does not give "carte blanche" for its whosale removal for Foundation wide indescriminant linking. Fails most every criteria for delisting consideration, including and not limited to;
http://www.encyclopediadramatica.com/Main_Page has been whitelisted by myself on en. However the arbcom ruling allowing for this home page linking is limmited within that article, Only. Arbcom in no way has sanctioned removal from the global blacklist. Perhaps requesting whitelisting of http://www.encyclopediadramatica.com/Main_Page on each of the individual wikis may be appropriate, however its wholsale removal is not. Each wiki using mediawiki software has a local whitelist. Only administrators of that wiki can modify their whitelist page. If you want a link added to the whitelist of a particular wiki, you should post a request on that wikis talk page --Hu12 04:06, 6 June 2008 (UTC)
It should really be removed, because English ArbCom only has authority over the English Wikipedia, not over all of the 700+ wikis. If it is a problem on a single wiki, then they should use the local blacklist. Monobi (talk) 21:25, 6 June 2008 (UTC)
A poor example, however here's a typical thing, regarding en Admin en:User:LaraLove;
  • Cut an paste
Use of shock or attack sites on any of the Wikimedia Foundations wikis has always been unacceptable regardless.--Hu12 22:19, 6 June 2008 (UTC)
Actually, after thinking about it some more, this URL should probably be blacklisted, because I can imagine it being spammed on smaller wikis that don't have a local blacklist set up. If wikis want to link to it, they can deal with it locally. Also, wouldn't the regex have to be \.(com|net|org)* ? Meh, I dunno. Monobi (talk) 05:30, 7 June 2008 (UTC)

This one has been extensively debated in the past. For the reasons above and the past discussions this is  Declined. Local whitelisting is perfectly possible & easy if teh local community are happy with it. Thanks --Herby talk thyme 06:53, 7 June 2008 (UTC)

gemisimo.com

Our site was blacklisted almost a year ago and has been blacklisted ever since, I read extensively the correspondence on discussions here: [8], [9] I also read External_links#Links_to_be_considered and I don't think we fall into this category.

Moreover, we improved a lot throughout the year and now our site offers a large variety of professional articles relating to topics within wiki. We would like to ask you to consider the removal of the site from the blacklist on the basis of our improvements and relevance to the wikipedia project in relation to diamonds. For an example diamonds.gemisimo.com/en/Diamond-Project/Diamond-Basics/Diamond-Shapes.html of useful info that could be added to wiki after approval by wikipedians on the discussion for a specific article. This blacklist is really hurting our efforts and money investments into the site, we have been penalized and accept it but on the same note we believe we paid the price for one persons foolish mistake that obviously will never happen again. Thanks for the consideration --79.177.9.237 13:18, 8 June 2008 (UTC)

Sorry, the fact that you seem mostly concerned about how "This blacklist is really hurting our efforts and money investments into the site" seems to make the point sufficiently.
Typically, we do not remove domains from the spam blacklist in response to site-owners' requests. Instead, we de-blacklist sites when trusted, high-volume editors request the use of blacklisted links because of their value in support of our projects. If such an editor asks to use your links, I'm sure the request will be carefully considered and your links may well be removed.
Until such time, this request is Declined. – Mike.lifeguard | @en.wb 19:13, 8 June 2008 (UTC)
I respect your opinion and I understand your position and with respect to what you said I did show that the site is useful for wiki and could very easily be used in wiki articles related to diamonds and actually it would be possible to make some new articles using some verifiable information on the site. Please let me know what I can do to help wiki and contribute so you can see that I sincerely mean what I say 77.127.136.85 00:39, 9 June 2008 (UTC)
My suggestion would be, try to contact some wikiprojects (on the English wikipedia, you would have to go to en:Wikipedia:WikiProject, other language wikipedia have similar systems, but they are probably called differently). There try to find an appropriate project, and discuss your information there. If there is a general consensus there that your site is indeed of interest, then one of the regular/established users (one that has no connection with the external site) there can come here and request unlisting and the regex will be removed promptly. I hope this explains. --Dirk Beetstra T C (en: U, T) 09:15, 9 June 2008 (UTC)

alizee-latino.com

Hello. I have a problem and I wish you could help me. First I'll explain you the situation: My Website has been added to the Blacklist today as it says in the Global Spam Blacklist

\balizee-latino\.com\b

...by you as it says in the log of the Blacklist: \balizee-latino\.com\b # Erwin # see User:COIBot/XWiki/alizee-latino.com

... and you say in your report: BOT generated XWiki report. More than 66% of the cross-wiki placing and addition of this link has been performed by one editor, and the link has been added to 3 or more wikipedia.

As you can see this is a BOT auto generated report. Now let me explain why this website shouldn't be in the BlackList: It was me, myself (username wikimeta) who added this link to all the Alizée's pages (Alizée it's a french singer). Why I added them to all the Wikipedia's languages?? Because we are an Official Website that Sony BMG Mexico knows about and Yahoo Mexico too. So our Site it's a an Official Fan Club in Latinoamerica. I've done this and I didn0t thougt that some kind of BOT was going to block me.

We really need our link in all the possible ALizée's pages because we're the official Site and forum. We don't have porn, piracy, virus, spyware, etc so i'ts not faire that we are in the BlackList. The Alizée's Concert in Mexico will be next week and we have all the official info, so it's really important that our link is aviable in all the languages possibles.

Please undestand and erase our Site from the BlackList. Please answer me as soon as possible. Remember, we're an Official Alizée's Fan Club in Latinoamerica and Sony BMG Mexico and Yahoo Mexico support.

This is urgent so I'll appreciate a quick answer. THANKS!!!!!!!!!!!!!!!!!!!!!!!!!! The preceding unsigned comment was added by Wikimeta (talk • contribs) 19:04, 11 June 2008.

I'm not saying that the site itself is unwanted, but the excessive linking is. Why should we link to all official fan clubs? It might be interesting for projects where a significant number of users come from Latin America, but in any case not for other languages. Please note that Wikipedia is not a link farm. More information can be found at w:en:Wikipedia:External_links. At the moment I see no reason to remove this site from the blacklist. It appears you'll just re-add the link to all projects. --Erwin(85) 11:07, 12 June 2008 (UTC)
I endorse blacklisting. Your main reason to add the link was to advertise your site, if you were here to improve the wikipedia, you would have added contents to the articles. We are writing an encyclopedia, here, not a linkfarm. --131.251.123.246 11:12, 12 June 2008 (UTC)
Generally speaking "fansites" of any sort are to be avoided - I always remove them on that basis. Wikipedia is an encyclopaedia not a place for fan club listings. Thanks --Herby talk thyme 11:18, 12 June 2008 (UTC)


Thanks for the answer. However I think you don't understand quite well what I'm saying. This is very important, all the official information for the concerts in Mexico will be in our site ONLY, every fan have the right to know all this info. OK it was wrong adding the link to all languages, I didn't saw it was wrong; I've learned a leason. So I just need the Link in the Spanish and English sections, that's all. If you always remove the fan sites, why just to remove mine, the official, and not the unofficial ones?? It's just not fair. However I know I have incurred in a fault in the External Links rules: "# Links mainly intended to promote a website." OK I've just thought "Hey it would be a nice idea posting it in every language" I will never ever do this again, sorry.

If we talk as we did now, I'm sure we can understand each other. As I read the rules, I've found this one too, please take a look: "What should be linked

  1. Articles about any organization, person, web site, or other entity should link to the official site if any."

All the information will be aviable on our site only and first that in any other place, there's the importance of this.

Also I've read this too, look:

"Non-English language content

Links to English language content are strongly preferred in the English-language Wikipedia. It may be appropriate to have a link to a non-English-language site, such as when an official site is unavailable in English; or when the link is to the subject's text in its original language; or when the site contains visual aids such as maps, diagrams, or tables."

As you know Mexico it's near United States... many fans from this country are coming to our concert and the have the main info from our site. OK, OK, maybe adding my link to the greek or japanesse sections was wrong, but I'm trying to explain why our link should be in the Spanish and English sections.

I've improved Wiki with translations that haven't been posted by myself several times. I won't give any further problems in the future, so please reconsider your decision, I'm beging you people. The Concert it's next Tuesday. I'll add the link in the Spanish and English sections only... OK? I won't break any rules if I just add the link in those 2 languages. So, what do you think???

Please... Please understand the importance of this T_T

--Wikimeta 17:14, 12 June 2008 (UTC)

Hmm, it might be best to add it to a local whitelist. Monobi (talk) 17:30, 12 June 2008 (UTC)

Ohh Almost forgot!! I can hear you saying "Hey, and who can say if this Web Site it's really Official?" Eeeasy. Please look at this link: http://sonybmg.mdenlinea.com/alizee/ As you can see it's the Official Site of Sony BMG mexico... Now please click in the EXTRAS tab... There you can see our link with the other official URLs. I think this is a good proof. So, I've learned my lesson and I think you people are guys who can understand a fool ans innocent mistake of someone like me. I won't add multiple links as I did, sorry, OK?

So I'll wait youy answer, Thanks and please reconsider ; ) --Wikimeta 17:38, 12 June 2008 (UTC)

It is the behaviour of those who place links as much as the site content to me. Thanks --Herby talk thyme 18:12, 12 June 2008 (UTC)
Indeed, Herby. It is your behaviour. You decided that only the weblink was important. I just did a search on some random language wikipages on Alizée, and none did even mention that she is giving a concert. Maybe you did not understand, we are writing an encyclopedia here, encyclopedias are based on content, information, and not on external links. There is no need to link to your site if the information is not notable enough to be mentioned in the content (and of all the other fans I suppose Alizée has and some of whom I expect to be established wikipedia editors, none bothered to enter that information). So again:  Declined. --Dirk Beetstra T C (en: U, T) 10:15, 13 June 2008 (UTC)


O.o! Impressive, I just can't belive this! But this is your world And I shall follow your rules. If your complain it's that the Wikipedia Project doesn't show anything about the Concert that's very simple to resolve, we can write down all the information here as well, that's not a problem. Information exclusive that nobody else knows about, we know everything before anyone. As I say I shall respect your world's rules but in fact, it's not faire banning globaly someone by the first mistake.

Some final questions. Just 2 more things: 1) How many time our WebSite will be in the BlackList? 2) As you said here in another post: Typically, we do not remove domains from the spam blacklist in response to site-owners' requests. Instead, we de-blacklist sites when trusted, high-volume editors request the use of blacklisted links because of their value in support of our projects. If such an editor asks to use your links, I'm sure the request will be carefully considered and your links may well be removed.

Those editors are Sony BMG Mexico and Yahoo Mexico... Tell me what do you need from them?? A letter? An e-mail? A video?? What can I do to proof you I'm not joking?

I just need an answer for those 2 questions. Thank you.

--Wikimeta 15:58, 13 June 2008 (UTC)

It will be here until the 'trusted, high-volume editors' request such a removal. With 'trusted, high-volume editors' we mean the wikipedia editors who want to use the link on wikipedia, not external organisations. Hope this explains. --Dirk Beetstra T C (en: U, T) 16:01, 13 June 2008 (UTC)

Troubleshooting and problems

This section is for archiving Troubleshooting and problems.

Discussion

This section is for archiving Discussions.

bugzilla:1505

For info & comments: Since this bug was done in rev:34769, we can possibly be more aggressive in blacklisting domains. The new behaviour is that the page can be saved if the URL was present in the previous revision - so the spam blacklist will stop only new spam - past spam must be handled separately as it will stay there until removed (and you won't be stopped from saving the page). – Mike.lifeguard | @en.wb 03:50, 14 May 2008 (UTC)

Thanks for the heads up. It's also useful for whitelisting. I whitelisted a domain on nlwiki yesterday because it was used as a source. That kind of forced me to whitelist even though I agreed with blacklisting here. I guess in such a case you don't necessarily have to whitelist now. --Erwin(85) 09:48, 14 May 2008 (UTC)
Would this also mean that the bots can put a working link in their reports, so the reports can be found via a Special:LinkSearch (easy, just adapt the LinkSummary template)? --Dirk Beetstra T C (en: U, T) 14:37, 16 May 2008 (UTC)
I guess so. The links are reported before blacklisting, so in case the address is blacklisted because of the report we can still comment. --Erwin(85) 14:49, 16 May 2008 (UTC)
Lets try .. or will this make our reports increase in google-findability?? --Dirk Beetstra T C (en: U, T) 14:53, 16 May 2008 (UTC)
The fact that it's a link as opposed to plaintext is irrelevant as we have nofollow on WMF sites. Also, it looks like we will be excluding bot reports from indexing altogether. – Mike.lifeguard | @en.wb 17:18, 16 May 2008 (UTC)

Closing old bot reports

I've been checking some old reports in Category:Open XWiki reports and most link additions were a few weeks or months old. I haven't come across a url that I think should be blacklisted and so I'm wondering if anyone has objections to closing all reports with less than say 3 edits and over a month old. I'm not suggesting to use a bot, but I am suggesting to close them without checking diffs. --Erwin(85) 16:06, 21 May 2008 (UTC)

I would say, do as you see fit. The reports are there, they may contain a couple which are really bad and we can prevent future troubles, the rest go. If it reoccurs, we will probably see it, and if it is really bad, add it now. Don't waste too much time on it, and don't worry if you close a bad one without adding .. --Dirk Beetstra T C (en: U, T) 16:15, 21 May 2008 (UTC)
Actually, I was considering using a bot to mass-close them. They are so stale as to be largely useless. The bot will bring back anything needing further attention. I suppose that would run the chance of missing something, but I am not prepared to spend enough time to go through them thoroughly. If you are, feel free.  – Mike.lifeguard | @en.wb 19:18, 21 May 2008 (UTC)
Using a bot would be fine by me as long as it's used for old cases. --Erwin(85) 19:59, 21 May 2008 (UTC)
Perform a close-run on everything with less than 5, and see what is left over. I am not worrying about them at all, but we may be able to blacklist some real rubbish based on them, so we don't have to do that whenever and have extra work. But if they get all just closed and then we see again .. they are not in the way, COIBot ignores the SpamReportBot ones anyway. --Dirk Beetstra T C (en: U, T) 09:47, 22 May 2008 (UTC)
I just set up a bot to close the reports. Using the toolserver I created a list of 123 reports with no edits in the last month. From those reports there were 40 with less than 5 links. My bot is closing them now. --Erwin(85) 13:14, 22 May 2008 (UTC)

I've written a bot to list all open reports at User:Erwin/Spamreports. If anyone's interested I could run it as a cron job. --Erwin(85) 18:17, 28 May 2008 (UTC)

Excluding our work from search engines

This is a bigger problem for enwiki than for us, but still... I'd like to ask that subpages of this page be excluded from indexing via robots.txt so we do not receive complaints about "You're publicly accusing us of spamming!" and the like. These normally end up in OTRS, where it is a waste of volunteers' time and energy. The MediaWiki search function is now good enough that we can use it to search this site for a domain rather than relying on a google search of meta (ie in {{LinkSummary}} etc). As well, we'll include the subpages for COIBot and LinkReportBot reports. – Mike.lifeguard | @en.wb 20:05, 10 May 2008 (UTC)

I have made a bug for this: bugzilla:14076. – Mike.lifeguard | @en.wb 20:10, 10 May 2008 (UTC)
Good idea. There's no need for these pages to be indexed. --Erwin(85) 08:06, 12 May 2008 (UTC)
Mike, I strongly agree with excluding crawlers from our bot pages.
I very much disagree, however, with excluding crawlers from this page and its archives. In many cases, seeing their name in a Google search is the first time at least half our hard-core spammers finally take us seriously. Since they usually have other domains we're unaware of, this deters further spam.
If domain-owners feel wronged about entries on this page, overworked OTRS volunteers should feel free to direct them to the removals section here and we can investigate. In my experience, many Wikipedia admins and editors don't enough experience with spam to know how to investigate removal requests and separate the sheep from the goats.
If there's been a false report, we can move the entries from our crawlable talk archives to a non-crawlable subpage (call it "false positives").
I'll also note that I think we've been getting more false positives blacklisted since we got these bot reports. I continue to feel strongly that we must be very conservative in blacklisting based on bot reports. If a site's been spammed, but it looks useful, wait until some project complains or blacklists it. Even if a site's been spammed and doesn't look useful, if the spammer hasn't gotten enough warnings then we shouldn't blacklist it unless we know he fully understands our rules -- that or we get a complaint from one of our projects. Perhaps we should have our bots issue multi-linual warnings in these cases.
As for the spammer that's truly spammed us in spite of 4 or more warnings, I don't care if he likes being reported as a spammer or not. I've spent almost two years dealing with this subset of spammers and they're going to be unhappy with us no matter what we do until they can get their links in. --A. B. (talk) 13:39, 12 May 2008 (UTC)
I agree with you that the bot reports probably have a threshold that are too low - perhaps that can be changed. Until such time, they need to be handled carefully.
The problem I am talking about is not false positives. Those are very straightforwardly dealt with. The real problem is not with people emailing us to ask to get domains de-listed, but rather with people emailing us demanding that we stop "libeling" them. What they want is for us to remove all references to their domain so their domain doesn't appear in search results next to the word "spam". Well, I'm not prepared to start blanking parts of the archives to make them stop whining - we need these reports for site maintenance. Instead we can have our cake and eat it too: don't allow the pages to be indexed, and keep the reports as-is. I'm not sure I see how having these pages indexed deters spammers. What I do see is lots of wasted time dealing with frivolous requests, and a way to fix that.
Just a reminder to folks that we should discuss this here, not in bugzilla. I am closing the bug as "later" which I thought I had done earlier. Bugzilla is for technical implementation (which is straightforward); this space is for discussion. They will yell at us if we spam them by discussing on bugzilla :) – Mike.lifeguard | @en.wb 16:16, 12 May 2008 (UTC)
If a site-owner has truly spammed us in spite of repeated warnings, then we are not libeling them if search engines pick up their listings here. I've dealt with such complaints before; I point out the clear evidence that's a matter of public record: warnings, diffs and our rules. I tell them if they find any factual inaccuracies in that record to let us know and we'll fix it immediately. I'm happy to discuss these blacklisting decisions with site-owners that bring them to the removals section.
Wikimedia's organization and servers are based in the United States; libel cases there are very difficult to pursue. A fundamental tenet there is that truth is an absolute defense. If our records are true and spammers have been previously warned and apprised of our rules, then they don't have a leg to stand on in that jurisdiction.
Servers may be based in the US, but they are accessible in the world. Thus you're liable in other courts, like say the UK (unless wikia's servers are in NY[10], whose law is questionable at best). In the UK just getting your case defended will cost you $200,000-plus up front, and much more if you lose - and if you are not represented, judgment will be entered against wikia in default. [11]
~ender 2008-05-18 11:04:AM MST
As for deterrence, periodically perusing hard core spammer forums like seoblackhat.com and syndk8.net as well as more general SEO forums like forums.digitalpoint.com will show lively discussions as to whether spamming us is worth the risks and aggravation. The more negative the chatter there about "wiki link-nazis", the better off our projects are.
My sense is that the volume of complaints has grown recently as we've blacklisted more domains based on bot reports. Some of these site-owners may not be truly innocent but they're not hard-core and haven't been warned sufficiently. Blacklisting comes as a large, alarming shock to them.
--A. B. (talk) 17:03, 12 May 2008 (UTC)
Of course it's not real libel. But that doesn't stop them from complaining, which is a waste of time. And a needless waste of time when it can be so easily stopped. On the other hand, I do see your point about being perceived as the link Nazis. Perhaps someone other than the three of us can share some thoughts?
Also, you're assuming there's a presumption of innocence. In other legal jurisdictions (and for a number of things in the US) that is not the case. You will have to prove that it is not libel.
~ender 2008-05-18 11:09:AM MST
I agree with excluding these pages from search engines. While Meta is nowhere near as high on searches as enwiki, it won't take much. Cary Bass demandez 23:18, 13 May 2008 (UTC)

I also agree, although I do believe that the fact that these reports rank so high works preventive. Finding these reports of other companies should stop other companies from doing the same. But the problems with the negative impact these reports may have on a company (although that is also not our responsibility, when editing wikipedia they get warned often enough that we are not a vehicle for advertising!) I think it is better to hide them, especially since our Bots are not perfect and sometimes pick up links wrongly, as well as we are only human and may make mistakes in reporting here as well. --Dirk Beetstra T C (en: U, T) 09:45, 14 May 2008 (UTC)

Dirk, I think the way to "have our cake and eat it" is to have this talk page and its archives crawlable and just exclude the bot pages. As for mistakes here, I don't see many true mistakes in human-submitted requests that actually get blacklisted. I at least skim at almost all the blacklist removal and whitelist requests both here and on en.wikipedia. The most common mistake humans make is to unknowingly exclude all of a large hosting service (such as narod.ru) instead of just the offending subdomain; I have yet to see a large hosting service complain. Otherwise, >>90% of our human-submitted requests nowadays have been pretty well thrashed out on other projects first before they even get here. Of those that are still flawed, they either get fixed or rejected with a public discussion. It's not that humans are so much smarter than bots, but rather, like so many other things on wiki, after multiple edits and human editors, the final human-produced product is very reliable.
I spend several hours a month reading posts on several closed "black hat" SEO forums I'm a "member" of. The reliability, finality and public credibility of our spam blacklisting process bothers a lot of black hats. I think our goal should be to keep keep this going. --A. B. (talk) 12:30, 14 May 2008 (UTC)
The point isn't whether we're right or wrong. The point is whether they complain and waste our time or not. It seems to me that looking like link Nazis publicly is not a very strong rationale if it conflicts with the goal of doing the work without wasted time. That said, views on this may differ. – Mike.lifeguard | @en.wb 16:01, 14 May 2008 (UTC)
Would it be too radical if we ask for a way to choose which pages shouldn't be indexed, onwiki? An extension can be easily created and installed for Meta which lets us add a <noindex /> tag to the top of the article, and cause an appropriate "noindex" meta tag to be added to the HTML output, and thus prevent that page (and only that page) from being indexed. How do you feel about my raw idea? Is it too prone to being abused? Huji 13:38, 15 May 2008 (UTC)
Sounds good, but that would, if enabled on en, give strange vandalism I am afraid. I would be more inclined in an extension which listens to a MediaWiki page, where pages can be listed which should not be indexed (that should include their subpages as well). Don't know how difficult that would be to create. --Dirk Beetstra T C (en: U, T) 16:25, 15 May 2008 (UTC)

Looking ahead

"Not dealing with a crisis that can be foreseen is bad management"

The Spam blacklist is now hitting 120K & rising quite fast. The log page started playing up at about 150K. What are our options looking ahead I wonder. Obviously someone with dev knowledge connections would be good to hear from. Thanks --Herby talk thyme 10:46, 20 April 2008 (UTC)

I believe that the extension is capable of taking a blacklist from any page (that is, the location is configurable, and multiple locations are possible). We could perhaps split the blacklist itself into several smaller lists. I'm not sure there's any similarly easy suggestion for the log though. If we split it up into a log for each of several blacklist pages, we wouldn't have a single, central place to look for that information. I suppose a search tool could be written to find the log entries for a particular entry. – Mike.lifeguard | @en.wb 12:24, 20 April 2008 (UTC)
What exactly are the problems with having a large blacklist? --Erwin(85) 12:34, 20 April 2008 (UTC)
Just the sheer size of it at a certain moment, it takes long to load, to search etc. The above suggestion may make sense, smaller blacklists per month, transcluded into the top level? --Dirk Beetstra T C (en: U, T) 13:16, 20 April 2008 (UTC)
Not a technical person but the log page became very difficult to use at 150K. Equally the page is getting slower to load. As I say - not a techy - but my ideal would probably be "current BL" (6 months say) & before that? --Herby talk thyme 13:37, 20 April 2008 (UTC)
I don't know how smart attempting to transclude them is... The spam blacklist is technically "experimental" (which sounds more scary than it really is) so it may not work properly. I meant we can have several pages, all of which are spam blacklists. You can have as many as you want, and they can technically be any page on the wiki (actually, anywhere on the web that is accessible) provided the page follows the correct format. So we can have one for each year, and just request that it be added to the configuration file every year, which will make the sysadmins ecstatic, I'm sure :P OTOH, if someone gives us the go-ahead for transclusion, then that'd be ok too. – Mike.lifeguard | @en.wb 22:12, 20 April 2008 (UTC)
A much better idea: bugzilla:13805 bugzilla:4459 ! – Mike.lifeguard | @en.wb 01:43, 21 April 2008 (UTC)
Just to note that my browser will no longer render the spam blacklist properly (though it's all there in edit view) - this is a real problem! – Mike.lifeguard | @en.wb 18:58, 14 May 2008 (UTC)
Still there for me but "I told you so" :) By the time I'm back you will have got it all sorted out...... --Herby talk thyme 19:06, 14 May 2008 (UTC)
My suggestion until then is to add another Spam blacklist 2 (needs configuration change). When logging, make sure you say which one you're adding to. We should possibly also split the current one in half so it will be easier to search and load. Configuration would look something like:
	$wgSpamBlacklistFiles = array(
		"DB: metawiki Spam_blacklist", //the current one
		"DB: metawiki Spam_blacklist_2" //the new one
		"DB: metawiki Spam_blacklist_3" //we can even have them configure an extra so that when #2 gets full-ish, we can just start on #3
	);
If we want to do this, it is an easy configuration change - took me <30s on my wiki. Using multiple blacklists is a pretty good solution until we can get a special page to manage the blacklist. The only downside I can see is you'd have to search more than one page to find a blacklist entry (but at least the page will load and render properly!) – Mike.lifeguard | @en.wb 16:54, 15 May 2008 (UTC)

I'm thinking of writing a new extension which works based on a real interface, and allows much better management. Werdna 08:15, 16 May 2008 (UTC)

Until then, I would suggest to go with Mike.lifeguard's suggestion. What about splitting of the old part? Or the unlogged part into a 'old' spam blacklist, and for the active cases to work with the normal, current blacklist. In that case there is no confusion where to add to, and things render properly. The old one is only edited when deleting an entry. --Dirk Beetstra T C (en: U, T) 08:51, 16 May 2008 (UTC)
What is the 'old' part? I think it's a good idea having one active list. In any case there should be some logic in splitting the lists, so choosing what list to add to won't be arbitrary. --Erwin(85) 11:17, 16 May 2008 (UTC)
It is a good idea, but only if it works. Right now it doesn't work for me, so I see this as a problem that needs fixing sooner rather than later. I think Beetstra's method of splitting it up would be fine. – Mike.lifeguard | @en.wb 17:20, 16 May 2008 (UTC)

Requested as bugzilla:14322 because this is just getting silly.  – Mike.lifeguard | @en.wb 22:30, 28 May 2008 (UTC)

  • Jimbo has requested a rename to something which does not carry the spam and blacklist connotations - site owners get very agitated about being called spammers, and there's no need to rub their noses in it. I suggest that if we do split the list, we could split by function - redirection sites including Universe Daily would be a decent size on its own, I think. 80.176.82.42 21:55, 15 June 2008 (UTC)

LinkWatchers

Database conversions

I am working on both loading the old database (about 4 months worth of links) and to rebuild the current database (about 5-6 weeks worth of links) into a new database.

  • The old database is in an old format, and has to be completely reparsed..
  • The new database had a few 'errors' in it, and I am adding two new fields.

I am running through the old databases by username, starting with aaa.

As a result the new database does not contain too much data yet, and will be 'biased' towards usernames early in the alphabet.

This process may take quite some time, maybe weeks, as I have to throttle the conversion to keep the current linkwatchers 'happy' (they are still running in real-time). These linkwatchers are also putting their data into this new database, so at everything after about 18:00, April 29, 2008 (UTC) is correct and complete.

The new database contains the following data, I will work later on making that more accessible for on-wiki research:

  1. timestamp - time when stored
  2. edit_id - service field
  3. lang - lang of wiki
  4. pagename - pagename
  5. namespace - namespace
  6. diff - link to diff
  7. revid - the revid, if known
  8. oldid - the oldid, if any
  9. wikidomain - the wikidomain
  10. user - the username
  11. fullurl - the full url that was added
  12. domain - domain, indexed and stripped of 'www.' -> www.example.com becomes com.example.
  13. indexedlink - rewrite of the fullurl, www.example.com/here becomes com.example./here
  14. resolved - the IP for the domain (new field, and if found)
  15. is it an ip - is the edit performed by an IP (new field)

I'll keep you posted. --Dirk Beetstra T C (en: U, T) 10:31, 30 April 2008 (UTC)

Well .. keep people posted:

We had two different tables, linkwatcher_linklog and linkwatcher_log. The former in a reasonable new format, the old one in a very old, outdated format.

The table linkwatcher_linklog is being transferred into linkwatcher_newlinklog, and when a record is converted, it is moved to a backup table (linkwatcher_linklogbackup). That conversion is at about 31% now.

The table linkwatcher_log is completely reparsed, and when a record is converted, the record is transferred into linkwatcher_logbackup. Also for this table the conversion is about 29%.

All converted data goes into linkwatcher_newlinklog, as does the data that is currently being recorded by the linkwatcher 'bots'.

  • linkwatcher_linklog - 1,459,158 records - 946.7 MB
  • linkwatcher_linklogbackup - 646,127 records - 323.5 MB
  • linkwatcher_log - 1,526,600 records - 1.0 GB
  • linkwatcher_logbackup - 628,575 records - 322.8 MB
  • linkwatcher_newlinklog - 2,152,052 records - 1.1 GB

Still quite some time to go. The conversion of linkwatcher_linklog is at usernames starting with 'fir', linkwatcher_log is at usernames starting with 'jbo'. --Dirk Beetstra T C (en: U, T) 20:54, 9 May 2008 (UTC)

Update:

  • linkwatcher_linklog - 761,429 records - 946.7 MB
  • linkwatcher_linklogbackup - 1,036,146 records - 525.7 MB (58% converted)
  • linkwatcher_log - 1,159,216 records - 1.0 GB
  • linkwatcher_logbackup - 995,959 records - 501.8 MB (46% converted)
  • linkwatcher_newlinklog - 3,448,605 records - 1.8 GB

linklog is at "OS2", log is at "Par". --Dirk Beetstra T C (en: U, T) 15:01, 16 May 2008 (UTC)

Update (the first one is starting to convert IPs):

  • linkwatcher_linklog - 303,562 records - 946.7 MB
  • linkwatcher_linklogbackup - 1,494,013 records - 754.8 MB (83% converted)
  • linkwatcher_log - 732,773 records - 1.0 GB
  • linkwatcher_logbackup - 1,422,402 records - 711.0 MB (66% converted)
  • linkwatcher_newlinklog - 5,062,298 - 2.6 GB

linklog is at '199', log is at 'xli' (had to take one down for some time, too much work for the box). --Dirk Beetstra T C (en: U, T) 19:19, 25 May 2008 (UTC)

Whee! One database has been converted (the bot quit just minutes ago):

  • linkwatcher_linklog - 0 records - 946.7 MB
  • linkwatcher_linklogbackup - 1,797,575 records - 912.5 MB
  • linkwatcher_log - 447,802 records - 1.0 GB
  • linkwatcher_logbackup - 1,707,373 records - 836.6 MB
  • linkwatcher_newlinklog - 6,038,610 - 3.1 GB

The other bot is now at 79%, at '59.' (somewhere in the IP-usernames). Getting there! --Dirk Beetstra T C (en: U, T) 11:49, 30 May 2008 (UTC)

The bots have finished this job. The table 'linkwatcher_newlinklog' contains at the time of this post 7,012,340 records. The database is now 'complete' from approx. 2007-09-01 on (except for some bot downtime, which may be up to several days in total). Bots are now working on getting the time into UTC (see below) and I am working on a bot that fills up the gaps, and that can also parse the periods before the linkwatchers started, or can parse wikis that we excluded from the linkwatchers. --Dirk Beetstra T C (en: U, T) 10:51, 7 June 2008 (UTC)

Timestamp

Timestamp is from now on stored as the UTC time of the edit. I will start updating the old records later. --Dirk Beetstra T C (en: U, T) 19:22, 3 June 2008 (UTC)

Thanks for that. Did you try linking to the domain yet? --Erwin(85) 20:01, 3 June 2008 (UTC)
No, I see on en sometimes people blanking the report because they disagree, COIBot would then not be able to save an update, since the blacklisted link is in there. It should work normally, but with that problem, it would go wrong.
For those records where the time has been converted to UTC, the bot adds ' (UTC)' to the timestamp. Take all other timestamps with a grain of salt, there are quite some which are wrong. A bot is working on it to update them, but due to insanity of the programmer of the bots, I have had to restart that program a couple of times (I suspect about 200.000 records are 'wrong' out of the 6.5 million in the database). --Dirk Beetstra T C (en: U, T) 10:31, 5 June 2008 (UTC)

Stats: old time: 6320971; new time: 746689. --Dirk Beetstra T C (en: U, T) 11:26, 11 June 2008 (UTC)

More statistics ... ????

I now have a fairly complete database with a lot of links (we are hitting 7 million records in a couple of hours ..), and I could get a lot of statistics from that. The current table has the following fields:

  1. timestamp - UTC Time (subject to updating)
  2. edit_id - service field
  3. lang - lang of wiki
  4. pagename - pagename
  5. namespace - namespace
  6. diff - link to diff
  7. revid - the revid, if known
  8. oldid - the oldid, if any
  9. wikidomain - the wikidomain
  10. user - the username
  11. fullurl - the full url that was added
  12. domain - domain, indexed and stripped of 'www.' and 'www#.' -> 'www.example.com' becomes 'com.example.'
  13. indexedlink - rewrite of the fullurl, www.example.com/here becomes com.example./here
  14. resolved - the IP for the domain (new field, and if found)
  15. is it an ip - is the edit performed by an IP (new field)

NOTES:

  • 'resolved' is the IP of the server the website is hosted on. Simply: A webserver is a computer, and that computer, when connected to the internet, has an IP. When you request a webpage, a 'nameserver' converts the name of the domain to the IP of the computer the site is hosted on, and then requests the 'computer with that IP' to send you the webpage referred to as 'domain'. In other words, if you have a computer, and run a webserver, you can register a large number of domains and host them on your computer. If you then are a spammer, you can use a large number of domains (which might prevent detection), but all these websites will still have the same 'resolved' IP!
  • we are converting the normal domain to the indexed one to make it searchable in a quicker way (MySQL reasons).

At the moment the bots calculate:

  1. how many links did this user add
  2. how often did this domain get added
  3. how often did this user add this domain
  4. to how many wikis did this user add this domain

But I also have the possibility to calculate:

  1. how many users that added this link were not using a user account.
  2. on how many computers are the websites that this user adds hosted.
  3. how often did domains that are hosted on 'this' computer (for a certain domain) get added.
  4. how often did this user add domains hosted on 'this' computer (for a certain domain).
  5. to how many wikis did this user add domains hosted on 'this' computer.

etc. etc.

The biggest problem is .. how to organise that information, and how to make that available to you here. But if there are statistics that would improve (y)our effords here greatly, please let me know. --Dirk Beetstra T C (en: U, T) 10:38, 6 June 2008 (UTC)

Looking back ...

I have just started to try and 'parse backward', in order to complete some gaps in the database (well, actually I am trying to parse 'everything'). Process is still under optimisation, will probably be VERY slow, and I am not yet sure on how to run it optimally. I am now parsing upwards, starting on the mediawiki 'page_id', starting at 1; e.g. api query on 'pageids'. This should favour the small wikis, which only have a couple of thousands of page(id)s. I'll keep you posted.

The 'backcrawled' data can be recognised by '(utc)' (in stead of '(UTC)') behind the timestamp. Please poke me if there is something wrong with some data, then I have to do something about it.

Backcrawled: 6043 records. --Dirk Beetstra T C (en: U, T) 11:26, 11 June 2008 (UTC)

Thresholding the xwiki

The linkwatchers now calculate for each link addition the following 4 values (all according to database filling):

  1. UserCount - how many external links did this user add
  2. LinkCount - how often is this external link added
  3. UserLinkCount - how often did this user add this link
  4. UserLinkLangCount - to how m any wikipedia did this user add this link.

The threshold was first:

if ((($userlinklangcount / $linkcount) > 0.90)  && ($linkcount > 2) && ($userlinklangcount > 2)) {
   report
}

I noticed, that when one user performs two edits on the first linkaddition in one wiki, and then starts adding to other wikis as well, that the user gets reported at edit 11, which I found way too late:

  • 3/3
  • 10/11
  • 11/12

earlier/inbetween combinations are not passing that threshold ..

The code is now:

if ((($userlinkcount / $linkcount) > 0.66)  && (($userlinklangcount / $linkcount) > 0.66 ) && ($userlinklangcount > 2)) {
  report
}

This is (userlink/link & wikis/link):

  • 3/3 & 3/3
  • 3/4 & 3/4
  • 4/5 & 4/5
  • 5/6 & 5/6
  • 5/7 & 5/7
  • 6/8 & 6/8

etc.

I am thinking to also do something like ($userlinkcount < xxx), which should take out some more established editors, xxx being .. 100 (we had a case of one editor adding 20 links in one edit .. you need to be hardcore to escape 100, adding 34 links every edit)?

I want to say here, the threshold is low, and maybe should be. Cleaning 10 wikis when crap is added is quite some work, it is easier to close/ignore one where only 4 wikis were affected. I will let this run, see what happens, this may give a lot more work, in which case I am happy to put the threshold higher.

Comments? --Dirk Beetstra T C (en: U, T) 15:15, 22 May 2008 (UTC)

I feel like this is too low. Most reports are simply closed with "reverted" as they're not enough to warrant blacklisting. As such, I'm not sure how useful that is. Not sure how the math should change, but there must be a happy medium.  – Mike.lifeguard | @en.wb 14:23, 3 June 2008 (UTC)
If they need reverting than I think it is OK. If I make it (e.g.) 5 then the 3 edits that are questionable did not get reverted. --Dirk Beetstra T C (en: U, T) 14:28, 3 June 2008 (UTC)

Useful investigation tool: url-info.appspot.com

In investigating User:COIBot/XWiki/url-info.appspot.com, I found that the linked site provides a small browser add-on that's a very useful tool for investigating all the links embedded in a page, whether it's a Wikimedia site page or an external site. I recommend others active with spam mitigation add it to their browser toolbar and check it out. This could have saved me many hours in the last year:

If you're trying to find possible related domains links on a spam site to investigate, this will quickly list them all, sparing the aggravation of clicking on every link. In fact it's so easy to glean information that we'll need to ensure we're not mindlessly reporting unrelated domains as "related" when they've appeared on a spam site page for some innocent reason:

As for the spam report for this domain, I don't think the extent of COI linking (just 1 link to each of 4 projects) currently meets the threshold for meta action; local projects can deal with this as they see fit. The tool is free and the page has no ads.

Note that appspot.com, the underlying main domain, is registered to Google for users of its App Engine development environment. --A. B. (talk) 14:22, 21 May 2008 (UTC)

It could be useful. It simply lists external links though, so like you said I guess most of 'm have nothing to do with the site. Note that the add-on is a w:en:bookmarklet, so not an extension or something. --Erwin(85) 16:06, 21 May 2008 (UTC)

Of possible interest to people here

I've detected a rise in new approaches to promotional activity involving Commons. I've posted there & others may wish to read/look. Thanks --Herby talk thyme 11:10, 31 May 2008 (UTC)

Log entry in bot reports

How do you feel about using javascript to replace ADMINNAME in a bot report's log entry with your own user name? The bots should use a span with a certain id like:

 \bexample\.com\b # <span id="adminname">ADMINNAME</span> # see [[[[User:COIBot/XWiki/example.com]]]]

If we then add the following code to MediaWiki:Common.js ADMINNAME will be replaced for sysops.

function setAdminName()
{
    for (key in wgUserGroups)
    {
        if (wgUserGroups[key] == 'sysop')
        {
 
            document.getElementById('adminname').innerHTML = wgUserName;
            break;
        }
    }
}
 
addOnloadHook(setAdminName)

We could of course set it equal to everyone's user name without checking if the person is a sysop, but I think that would be confusing. It doesn't save a whole lot of time, but I'm just lazy. --Erwin(85) 13:26, 22 June 2008 (UTC)

I am in favour, I'll prepare the bot for it. --Dirk Beetstra T C (en: U, T) 10:19, 24 June 2008 (UTC)
I have tweaked the line a bit more, so that the first # is aligned at horizontal position 40. I think that that is about right, and gives a nicer formatted log, except for very long regexes (but if you want it on another position, please poke me). --Dirk Beetstra T C (en: U, T) 11:34, 24 June 2008 (UTC)
I've added it to Common.js. --Erwin(85) 19:55, 25 June 2008 (UTC)

Random (junk?) thought

It went here! (it's a wiki - if anyone disagrees...) --Herby talk thyme 09:12, 9 June 2008 (UTC)