Jump to content

User talk:Citation bot: Difference between revisions

Page contents not supported in other languages.
From Wikipedia, the free encyclopedia
Content deleted Content added
ClueBot III (talk | contribs)
m Archiving 1 discussion to User talk:Citation bot/Archive_18. (BOT)
:I requested a bot block at WP:ANI for this. This is way too large an unsupervised bot run. ~~~~
Line 206: Line 206:
::: it would be nice to have multiple bot/zotero accounts [[User:AManWithNoPlan|AManWithNoPlan]] ([[User talk:AManWithNoPlan|talk]]) 20:36, 2 September 2019 (UTC)
::: it would be nice to have multiple bot/zotero accounts [[User:AManWithNoPlan|AManWithNoPlan]] ([[User talk:AManWithNoPlan|talk]]) 20:36, 2 September 2019 (UTC)
And again, because of the massive run against [[:Category:CS1 errors: missing periodical]], with over 300K articles in it. Please kill this! &#32;<span style="font-variant:small-caps; whitespace:nowrap;">[[User:Headbomb|Headbomb]] {[[User talk:Headbomb|t]] · [[Special:Contributions/Headbomb|c]] · [[WP:PHYS|p]] · [[WP:WBOOKS|b]]}</span> 14:54, 10 September 2019 (UTC)
And again, because of the massive run against [[:Category:CS1 errors: missing periodical]], with over 300K articles in it. Please kill this! &#32;<span style="font-variant:small-caps; whitespace:nowrap;">[[User:Headbomb|Headbomb]] {[[User talk:Headbomb|t]] · [[Special:Contributions/Headbomb|c]] · [[WP:PHYS|p]] · [[WP:WBOOKS|b]]}</span> 14:54, 10 September 2019 (UTC)
:I requested a bot block at [[WP:ANI]] for this. This is way too large an unsupervised bot run. &#32;<span style="font-variant:small-caps; whitespace:nowrap;">[[User:Headbomb|Headbomb]] {[[User talk:Headbomb|t]] · [[Special:Contributions/Headbomb|c]] · [[WP:PHYS|p]] · [[WP:WBOOKS|b]]}</span> 17:43, 10 September 2019 (UTC)


== Citation bot completes two journal citations but leaves two journal citations from the same source uncompleted ==
== Citation bot completes two journal citations but leaves two journal citations from the same source uncompleted ==

Revision as of 17:43, 10 September 2019

You may want to increment {{Archive basics}} to |counter= 19 as User talk:Citation bot/Archive 18 is larger than the recommended 150Kb.

Note that the bot's maintainer and assistants (Thing 1 and Thing 2), can go weeks without logging in to Wikipedia. The code is open source and interested parties are invited to assist with the operation and extension of the bot.

Before reporting a bug, please note: Addition of DUPLICATE_xxx= to citation templates by this bot is a feature. When there are two identical parameters in a citation template, the bot renames one to DUPLICATE_xxx=. The bot is pointing out the problem with the template. The solution is to choose one of the two parameters and remove the other one, or to convert it to an appropriate parameter.

Or, for a faster response from the maintainers, submit a pull request with appropriate code fix on GitHub, if you can write the needed code.

Request: Capitalize linked journals

Status
Feature Request
Reported by
Headbomb {t · c · p · b} 17:09, 30 April 2019 (UTC)[reply]
What should happen
[1]
We can't proceed until
Feedback from maintainers


That is very dangerous territory. We would have to verify that the old page did not exist at all and that the new page did exist. We really have not ever got in the business of fixing red links. AManWithNoPlan (talk) 15:01, 2 May 2019 (UTC)[reply]

It's not a matter of fixing redlinks, it's a matter of capitalization. E.g. Journal of physics vs Journal of Physics or INTERNATIONAL JOURNAL OF SYSTEMATIC AND EVOLUTIONARY MICROBIOLOGY vs International Journal of Systematic and Evolutionary Microbiology. Or Developmental neuroscience vs Developmental Neuroscience. Headbomb {t · c · p · b} 15:43, 2 May 2019 (UTC)[reply]
And in the rare case that the capitalized version links to a different page, it will link to the correct page instead of the wrong one. Headbomb {t · c · p · b} 15:48, 2 May 2019 (UTC)[reply]
Unless it’s a foreign-language title — the bot sometimes gets a little overzealous capitalizing words, and a redirect from a title with extra capitalization might not exist yet for articles about some publications. Umimmak (talk) 14:23, 4 September 2019 (UTC)[reply]
That's mostly taken care of through this + a custom list of foreign titles. This is just bringing the bot inline with what it would do to an unlinked title, so if there's an issue with capitalization, it wouldn't be specific to the linked version. Headbomb {t · c · p · b} 14:33, 4 September 2019 (UTC)[reply]

Strip semicolons

Status
feature request
Reported by
Headbomb {t · c · p · b} 00:13, 18 June 2019 (UTC)[reply]
What should happen
[2], [3]
We can't proceed until
Feedback from maintainers


This should perhaps not apply to |title= however. Also might not be safe to do in some identifiers. Headbomb {t · c · p · b} 00:14, 18 June 2019 (UTC)[reply]

and as always titles good friend |chapter= too. AManWithNoPlan (talk) 13:57, 23 June 2019 (UTC)[reply]
And contribution and other aliases. Headbomb {t · c · p · b} 15:49, 23 June 2019 (UTC)[reply]
And NOT & a m p ; and his friends. AManWithNoPlan (talk) 01:25, 30 June 2019 (UTC)[reply]
Rather than a blacklist, we would want a white list of parameters. AManWithNoPlan (talk) 15:17, 6 July 2019 (UTC)[reply]
Get the list of all parameters and remove those then. Headbomb {t · c · p · b} 20:07, 6 July 2019 (UTC)[reply]
https://github.com/ms609/citation-bot/pull/2130 AManWithNoPlan (talk) 19:06, 31 August 2019 (UTC)[reply]
Should also handle author/editors/contributors/others (and their variants) Headbomb {t · c · p · b} 19:55, 31 August 2019 (UTC)[reply]

Idea: Usage stats

Not really anything pressing, but now that we have OAuth in, it would be neat to have usage statistics. Who makes use of the bot. If the bot is activated via the web interface, scripts, etc... Or whatever else is trackable. Headbomb {t · c · p · b} 06:15, 27 June 2019 (UTC)[reply]

I guess one could sort the bot contributions based on if the edit summary said “category” and one could query Wikipedia and search for edit summaries with the “use this tool” text in them. AManWithNoPlan (talk) 13:40, 27 June 2019 (UTC)[reply]
Having a &via=... in the API would likely be a better way of tracking things, but right now I'm mostly thinking about something very non-critical. I'll take any bug fix and things that actual affect the edits of the bots over usage stats thought. Just figured if one of the talk page stalkers felt like compiling stats, or build a sub-module that would export information into an external database after every edit, well that's a nice little project. Headbomb {t · c · p · b} 17:17, 27 June 2019 (UTC)[reply]
We currently have no logging, so any logging would have to be done in the edit summaries. AManWithNoPlan (talk) 14:32, 28 June 2019 (UTC)[reply]
"Currently", yup. But if there was logging, we could have graphs/stats like [4], except for citation bot usage, instead of pageviews.
Anyway, it's an idea more than anything. Not critical by far, and I'd rather have someone else work on that if that ever gets done (unless we suddendly run out of edit-related bug fixes and feature requests). Headbomb {t · c · p · b} 15:15, 28 June 2019 (UTC)[reply]
Could it be enough to add a hashtag and rely on toolforge:hashtags/? Nemo 15:37, 28 June 2019 (UTC)[reply]

Question about handles

Extended content

I'm building a list of various handle links, e.g.

What do you need to know to implement hdl convertion? Do you need to know all root paths

domains? Or just

or even just

? Headbomb {t · c · p · b} 07:24, 27 June 2019 (UTC)[reply]

Also does knowing http vs https matter? Headbomb {t · c · p · b} 07:25, 27 June 2019 (UTC)[reply]

http and https is irrelevant. Right now, each and every URL path is specific. I should change it to be hosts and paths separate. Hosts is probably enough, unless you find a new file path beyond the usual suspects. Please verify each host actually works though; http://oasis.postech.ac.kr/handle/2014.oak/9965 is not a handle 🙄. AManWithNoPlan (talk) 13:45, 27 June 2019 (UTC)[reply]

@AManWithNoPlan:, well, I'm building a massive list with the help of others (e.g. [5]), so I want to know what's the most useful format. Right now, if I have something like

I'll eliminate things that only differ after the /handle/ part, and have something like

and currently have 2169 such paths. Which I could reduce to (after checking that they indeed work inside a {{hdl}})

But I was wondering if there was a way to trim that down further to something more manageable/less redundant. Headbomb {t · c · p · b} 17:05, 27 June 2019 (UTC)[reply]

While is is true that some of them probably do not have all these possibilities, I doudt that we would run into a case where http://digilib.gmu.edu/dspace/handle/ works, but http://digilib.gmu.edu/bitstream/handle/ is not a handle but some thing else. So, what I need are three lists:

  1. Protocol: http and https (short list)
  2. Host names (HUGE list)
  3. Suffix list (/handle/, /bitstream/handle/, ....) (medium sized list).

The code can then accept and convert any combination. AManWithNoPlan (talk) 17:22, 27 June 2019 (UTC)[reply]

That works. Headbomb {t · c · p · b} 17:29, 27 June 2019 (UTC)[reply]

The easy stuff

Protocols: https*
Suffix:\/(dspace|dspace-law|jspui|repository|xmlui)?(\/?bitstream\/)?handle\/

Going to build the host names list. It's in the ballpark of 1228 domains. Headbomb {t · c · p · b} 17:55, 27 June 2019 (UTC)[reply]

currently we use a single Regex. I will need to change that. I already have a plan for some simple fast code. AManWithNoPlan (talk) 18:58, 27 June 2019 (UTC)[reply]
Code written, now for testing. AManWithNoPlan (talk) 21:19, 27 June 2019 (UTC)[reply]
https://github.com/ms609/citation-bot/pull/1856 AManWithNoPlan (talk) 21:20, 27 June 2019 (UTC)[reply]
More https://github.com/ms609/citation-bot/pull/1857 AManWithNoPlan (talk) 23:37, 27 June 2019 (UTC)[reply]
when you have a host list post the link. AManWithNoPlan (talk) 03:49, 28 June 2019 (UTC)[reply]
A preview is in User:Headbomb/Sandbox. User:Betacommand will run a script to see which handle links resolve when put into a {{hdl}}. I'll then be able to give you a list of domains that could be converted. It likely won't cover everything, but it'll probably cover 95%+ of cases. Headbomb {t · c · p · b} 03:56, 28 June 2019 (UTC)[reply]
Headbomb Got a final list yet? AManWithNoPlan (talk) 15:31, 3 July 2019 (UTC)[reply]
Still chugging at it. The list of HDL urls that didn't work needs manual review still, because some of the servers were only temporarily down and was not in the most convenient of formats. Should have it by the end of the week though. Headbomb {t · c · p · b} 15:47, 3 July 2019 (UTC)[reply]
Headbomb Got a final list yet? AManWithNoPlan (talk) 14:27, 19 July 2019 (UTC)[reply]
Still working on it. Not forgotten though. I was travelling for a while, then had computer issues (dead PSU) which prevented me from. Hoping to have it done this weekend. Headbomb {t · c · p · b} 16:08, 19 July 2019 (UTC)[reply]
Headbomb any progress AManWithNoPlan (talk) 17:28, 16 August 2019 (UTC)[reply]
It's still on the to-do list. Headbomb {t · c · p · b} 22:30, 24 August 2019 (UTC)[reply]
Status
new bug
Reported by
Headbomb {t · c · p · b} 12:45, 24 July 2019 (UTC)[reply]
What happens
[6]
What should happen
[7]
We can't proceed until
Feedback from maintainers


That's more annoying than it sounds since we have to check a lot of name parameters. AManWithNoPlan (talk) 18:25, 9 August 2019 (UTC)[reply]

Category/batch whitelist

Category/batch runs are being abused. Possibilities on dealing with this are:

  • a) a whitelist of people allowed to ask for unlimited category/batch runs
    • This could just be something like extended confirmed.
  • b) a whitelist for limited category/batch runs (say ~250 pages at once, tops)
    • This could just be something like autoconfirmed/confirmed.
  • c) a way to kill inappropriate category/batch runs

And have category/batch runs disabled/greatly limited (~25 articles) for non-confirmed/whitelisted users. Headbomb {t · c · p · b} 17:48, 7 August 2019 (UTC)[reply]

A/B may also prevent sock puppets and "suspicious" new users that may intend to use the bot in ways that are undesired from doing so. Users without edits or very few edits might not check their edits or won't see possible mistakes by the bot and as such won't report them. Proposal B seems like a good one to go forward with in any case in my opinion. For proposal C it might good to define who could use that option, only maintainers and the operator or also some "trusted" users + we would also need to define what is considered inappropriate. For option A it might also be an idea to let extended confirmed up to 1000 pages, and then have a further whitelist of users who can do unlimited runs ie bureaucrats,administrators and "normal" users who have proven to understand of what the bot does, the impact of extremely large runs (ie don't run during high usage times) and possibly are also actively reporting bugs and joining in discussion here. Just a few things to think about. -Redalert2fan (talk) 20:05, 16 August 2019 (UTC)[reply]

Garbage archive-url cleanup

Status
new bug
Reported by
Headbomb {t · c · p · b} 15:26, 14 August 2019 (UTC)[reply]
What should happen
[8]
We can't proceed until
Feedback from maintainers


Better batch/queuing handling revisited

The bot has been effectively disabled for the last week or so due to Chris Capoccia's insanely large-category requests (e.g. Category:Pages with citations having bare URLs) that hog all the resources. Please implement better parallelism à la first "Extended content" box in the see also link above, or something similar enough that one or two large requests doesn't disable the bot for everyone else. Headbomb {t · c · p · b} 15:40, 1 September 2019 (UTC)[reply]

maybe some simple intermediate steps would be good. currently the bot is still churning on something from a couple days ago. it doesn't even appear in any of my browser windows and there's no way for me to stop it. maybe the bot could refuse to do large requests. or even eliminate the category box altogether.  — Chris Capoccia 💬 14:12, 2 September 2019 (UTC)[reply]
For intermediate steps, see #Category/batch whitelist. Headbomb {t · c · p · b} 19:55, 2 September 2019 (UTC)[reply]
it would be nice to have multiple bot/zotero accounts AManWithNoPlan (talk) 20:36, 2 September 2019 (UTC)[reply]

And again, because of the massive run against Category:CS1 errors: missing periodical, with over 300K articles in it. Please kill this! Headbomb {t · c · p · b} 14:54, 10 September 2019 (UTC)[reply]

I requested a bot block at WP:ANI for this. This is way too large an unsupervised bot run. Headbomb {t · c · p · b} 17:43, 10 September 2019 (UTC)[reply]

Citation bot completes two journal citations but leaves two journal citations from the same source uncompleted

Status
new bug
Reported by
Jo-Jo Eumerus (talk, contributions) 10:19, 10 September 2019 (UTC)[reply]
What happens
Before this edit, the reference list contained four incomplete "cite journal" templates with http://adsabs.harvard.edu/abs/ URLs. Of these four templates, two had journal, volume, author and pagenumber information filled in, the other two didn't. It's not clear why they didn't get completed as well.
Relevant diffs/links
[9] + [10]
We can't proceed until
Feedback from maintainers