How do I removing URLs from text?

I would like help in parsing text in Ruby.


@BreakingNews: Typhoon Morakot hits Taiwan, China evacuates thousands http://news.bnonews.com/u4z3

I would like to eliminate all the hyperlinks, returning plain text.

@BreakingNews: Typhoon Morakot hits Taiwan, China evacuates thousands


foo = "@BreakingNews: Typhoon Morakot hits Taiwan, China evacuates thousands http://news.bnonews.com/u4z3" r = foo.gsub(/http:\/\/[\w\.:\/]+/, '') puts r # @BreakingNews: Typhoon Morakot hits Taiwan, China evacuates thousands


This is an old, but good, question. Here's an answer that relies on Ruby's built-in URI:

require 'set' require 'uri' text = '@BreakingNews: Typhoon Morakot hits Taiwan, China evacuates thousands http://news.bnonews.com/u4z3' schemes_regex = /^(?:#{ URI.scheme_list.keys.join('|') })/i URI.extract(text).each do |url| text.gsub!(url, '') if (url[schemes_regex]) end puts text.squeeze(' ')

And a pass through IRB showing what's happening and the resulting output:

I defined the text to search:

irb(main):004:0* text = '@BreakingNews: Typhoon Morakot hits Taiwan, China evacuates thousands http://news.bnonews.com/u4z3' => "@BreakingNews: Typhoon Morakot hits Taiwan, China evacuates thousands http://news.bnonews.com/u4z3"

I defined a regex of URI schemes we want to react to. This is a defensive move because URI returns a false-positive in its search step:

irb(main):006:0* schemes_regex = /^(?:#{ URI.scheme_list.keys.join('|') })/i => /^(?:FTP|HTTP|HTTPS|LDAP|LDAPS|MAILTO)/i

Let URI walk through the text finding URLs. For each one found, if it's a scheme we want to react to, strip all its occurrences from the text:

irb(main):008:0* URI.extract(text).each do |url| irb(main):009:1* text.gsub!(url, '') if (url[schemes_regex]) irb(main):010:1> end

These are the URLs URI.extract found. It erroneously reports BreakingNews: because of the trailing :. I think it's not too sophisticated, but for normal use it's fine:

=> ["BreakingNews:", https://www.e-learn.cn/content/wangluowenzhang/"http://news.bnonews.com/u4z3"]

Show what the resulting text was:

irb(main):012:0* puts text.squeeze(' ') @BreakingNews: Typhoon Morakot hits Taiwan, China evacuates thousands


It can be done in quick and dirty way or in a sophisticated way. I am showing the sophisticated way:

require 'rubygems' require 'hpricot' # you may need to install this gem require 'open-uri' ## first getting the embeded/framed html file's url start_url = 'http://news.bnonews.com/u4z3' doc = Hpricot(open(start_url)) news_html_url = doc.at('//link[@href]').to_s.match(/(http[^"]+)/) ## now getting the news text, its in the 3rd <p> tag of the framed html file doc2 = Hpricot(open(news_html_url.to_s)) news_text = doc2.at('//p[3]').to_plain_text puts news_text

