Matching subdomain and top domain using regex in Java


Follow up of this question <a href="https://stackoverflow.com/questions/12393918/regex-to-match-pattern-with-subdomain-in-java" rel="nofollow">Regex to match pattern with subdomain in java</a>

I use the below pattern to match the domain and subdomain

Pattern pattern = Pattern.compile("http://([a-z0-9]*.)example.com");

this pattern matches the following

<ul><li>http://asd.example.com</li> <li>http://example.example.com</li> <li>http://www.example.com</li> </ul>

but it is not matching

<ul><li>http://example.com</li> </ul>

Can any one tell me how to match http://example.com too?


Just make the first part optional with a ?:

Pattern pattern = Pattern.compile("http://([a-z0-9]*\\.)?example\\.com");

Note that . matches any character, you should use \\. to match a literal dot.


You can use this regex pattern to get domains of all urls:


For example;

Input = http://www.google.com/search?q=a Output = http://www.google.com Input = ftp://www.google.com/search?q=a Output = ftp://www.google.com Input = www.google.com/search?q=a Output = www.google.com

Here, \p{L}{0,10} stands for the http, https and ftp parts (there could be some more i don't know), (?:://)? stands for :// part if appears, [\p{L}\.]{1,50} stands for the foo.bar.foo.com part. The rest of the url is cut out.

And here is the java code that accomplises the job:

public static final String DOMAIN_PATTERN = "\\p{L}{0,10}(?:://)?[\\p{L}\\.]{1,50}"; public static String getDomain(String url) { if (url == null || url.equals("")) { return ""; } Pattern p = Pattern.compile(DOMAIN_PATTERN); Matcher m = p.matcher(url); if (m.find()) { return m.group(); } return ""; } public static void main(String[] args) { System.out.println(getDomain("www.google.com/search?q=a")); } Output = www.google.com

Finally, if you want to match just "example.com" you can simply add it to the end of the pattern like :


And this will get all of the domains with "example.com":

Input = http://www.foo.bar.example.com/search?q=a Output = http://www.foo.bar.example.com

Note : Note that \p{Ll} can be used instead of \p{L} because \p{Ll} catches lowercase unicode letters (\p{L} all kind of unicode letters) and urls are constructed of lowercase letters.


