Read Wikipedia piped links


I'm using java and I want to read piped links from Wikipedia that has a specific surface form. Fir example in this form [America|US] the surface form is "US" and the internal link is "America".

The straightforward solution is to read the xml dump of Wikipedia and find the strings that matches the regular expression for a piped link. However I am afraid that I wouldn't cover all the possible regular expressions of a piped link. I searched and I couldn't find any library that specifically give me the piped links.

Any suggestions?



Now that I understand the question: I don't think there is a way to get all internal links with their printout value. This is simply not stored in the <a href="https://www.mediawiki.org/wiki/Manual:Database_layout" rel="nofollow">database</a> (only <a href="https://www.mediawiki.org/wiki/Manual:Pagelinks_table" rel="nofollow">links</a> are), because the actual output is only created when the page is rendered.

You would have to <a href="https://www.mediawiki.org/wiki/API:Parsing_wikitext" rel="nofollow">parse the pages</a> yourself to be sure to get all links. Of course, if you can accept getting only the subset of links available in the wikitext of each page, parsing the xml dump as you suggests would work. Note that one single regex will most likely not distinguish between piped internal links, and <a href="https://meta.wikimedia.org/wiki/Help:Piped_link" rel="nofollow">piped interwiki links</a>. Also beware of image links, that use pipes for variable separation (e.g. [[Image:MyImage.jpeg|thumb|left|A caption!]]).

Here is the regex used by the <a href="https://git.wikimedia.org/blob/mediawiki%2Fcore.git/d2674373c55681b5c4f24c7183b34030487403c2/includes%2Fparser%2FParser.php" rel="nofollow">MediaWiki parser</a>:

$tc = Title::legalChars() . '#%'; # Match a link having the form [[namespace:link|alternate]]trail $e1 = "/^([{$tc}]+)(?:\\|(.+?))?]](.*)\$/sD"; # Match cases where there is no "]]", which might still be images $e1_img = "/^([{$tc}]+)\\|(.*)\$/sD";

However, this codes is applied after a lot of preprocessing has happened.

<strong>Old answer</strong>

Using a xml dump will not give you all links, as many links are produced by <a href="https://www.mediawiki.org/wiki/Help:Templates" rel="nofollow">templates</a>, or in some cases even <a href="https://www.mediawiki.org/wiki/Extension:ParserFunctions" rel="nofollow">parser functions</a>. A simpler way would be to use the <a href="https://www.mediawiki.org/wiki/API:Main_page" rel="nofollow">API</a>:


I am assuming English Wikipedia here, but it will work anywhere, just substitute en. in the url for your language code. The redirects directive will, quite obviously, make sure to follow redirects. In the same way, use prop=extlinks to get external links:


You can grab links for multiple pages at once, either by separating their name with a pipe character, like this: Stack_Overflow|Chicago, or by using a generator, e.g. <a href="https://www.mediawiki.org/wiki/API:Allpages" rel="nofollow">allpages</a> (to run the query against every single page in the wiki), like this:


The number of results returned by the allpages generator can be raise by setting the gaplimit parameter, e.g. &gaplimit=50 to get <a href="https://en.wikipedia.org/w/api.php?action=query&generator=allpages&gaplimit=50&prop=links" rel="nofollow">all external links for the first 50 pages</a>. If you request <a href="https://en.wikipedia.org/wiki/Wikipedia:Bots" rel="nofollow">bot status</a> at the Wikipedia edition you are looking at, you can get as high as 5000 results per request, otherwise the maximum is 500 for most (probably all) Wikipedias.


  • .Net WikiText to HTML Parser [closed]
  • Web API Basic Auth inside an MVC app with Identity Auth
  • SiteMesh: Changing the content-type of the response
  • Disadvantages to high make job values
  • const char **a = {“string1”,“string2”} and pointer arithametic
  • Deserialize Dictionary
  • Find unique tuples in a relation represented by a BDD
  • python - calculate orthographic similarity between words of a list
  • How can I get the full list of running processes on a Mac from a python app
  • Clear activity stack before launching another activity
  • Regex to match a string not followed by anything
  • Angular2 Response for preflight is invalid (redirect) from some GET requests
  • How do I configure context broker accept post requests from my remote sensor?
  • Access variable of ScriptContext using Nashorn JavaScript Engine (Java 8)
  • Parse a date string in a specific locale (not timezone!)
  • Problem while Building a Setup Project for a windows Service?
  • How to attach a node.js readable stream to a Sendgrid email?
  • Functions in global context
  • Unity3D & Android: Difference between “UnityMain” and “main” threads?
  • Extracting HTML between tags
  • MongoDB in PHP using aggregate to group by _id is null not working
  • Why value captured by reference in lambda is broken? [duplicate]
  • Sails.js/waterline: Executing waterline queries in toJSON function of a model?
  • Master page gives error
  • Regex thinks I'm nesting, but I'm not
  • output of program is not same as passed argument
  • Modifying destination and filename of gulp-svg-sprite
  • Deserializing XML into class C#
  • Which linear programming package should I use for high numbers of constraints and “warm starts” [clo
  • Javascript + PHP Encryption with pidCrypt
  • Redux, normalised entities and lodash merge
  • Do create extension work in single-user mode in postgres?
  • Function pointer “assignment from incompatible pointer type” only when using vararg ellipsis
  • jqPlot EnhancedLegendRenderer plugin does not toggle series for Pie charts
  • Comma separated Values
  • How to CLICK on IE download dialog box i.e.(Open, Save, Save As…)
  • embed rChart in Markdown
  • python draw pie shapes with colour filled
  • How to Embed XSL into XML
  • How to load view controller without button in storyboard?