9932

How do I removing URLs from text?

I would like help in parsing text in Ruby.

Given:

@BreakingNews: Typhoon Morakot hits Taiwan, China evacuates thousands http://news.bnonews.com/u4z3

I would like to eliminate all the hyperlinks, returning plain text.

@BreakingNews: Typhoon Morakot hits Taiwan, China evacuates thousands

Answer1:

foo = "@BreakingNews: Typhoon Morakot hits Taiwan, China evacuates thousands http://news.bnonews.com/u4z3" r = foo.gsub(/http:\/\/[\w\.:\/]+/, '') puts r # @BreakingNews: Typhoon Morakot hits Taiwan, China evacuates thousands

Answer2:

This is an old, but good, question. Here's an answer that relies on Ruby's built-in URI:

require 'set' require 'uri' text = '@BreakingNews: Typhoon Morakot hits Taiwan, China evacuates thousands http://news.bnonews.com/u4z3' schemes_regex = /^(?:#{ URI.scheme_list.keys.join('|') })/i URI.extract(text).each do |url| text.gsub!(url, '') if (url[schemes_regex]) end puts text.squeeze(' ')

And a pass through IRB showing what's happening and the resulting output:

I defined the text to search:

irb(main):004:0* text = '@BreakingNews: Typhoon Morakot hits Taiwan, China evacuates thousands http://news.bnonews.com/u4z3' => "@BreakingNews: Typhoon Morakot hits Taiwan, China evacuates thousands http://news.bnonews.com/u4z3"

I defined a regex of URI schemes we want to react to. This is a defensive move because URI returns a false-positive in its search step:

irb(main):006:0* schemes_regex = /^(?:#{ URI.scheme_list.keys.join('|') })/i => /^(?:FTP|HTTP|HTTPS|LDAP|LDAPS|MAILTO)/i

Let URI walk through the text finding URLs. For each one found, if it's a scheme we want to react to, strip all its occurrences from the text:

irb(main):008:0* URI.extract(text).each do |url| irb(main):009:1* text.gsub!(url, '') if (url[schemes_regex]) irb(main):010:1> end

These are the URLs URI.extract found. It erroneously reports BreakingNews: because of the trailing :. I think it's not too sophisticated, but for normal use it's fine:

=> ["BreakingNews:", https://www.e-learn.cn/content/wangluowenzhang/"http://news.bnonews.com/u4z3"]

Show what the resulting text was:

irb(main):012:0* puts text.squeeze(' ') @BreakingNews: Typhoon Morakot hits Taiwan, China evacuates thousands

Answer3:

It can be done in quick and dirty way or in a sophisticated way. I am showing the sophisticated way:

require 'rubygems' require 'hpricot' # you may need to install this gem require 'open-uri' ## first getting the embeded/framed html file's url start_url = 'http://news.bnonews.com/u4z3' doc = Hpricot(open(start_url)) news_html_url = doc.at('//link[@href]').to_s.match(/(http[^"]+)/) ## now getting the news text, its in the 3rd <p> tag of the framed html file doc2 = Hpricot(open(news_html_url.to_s)) news_text = doc2.at('//p[3]').to_plain_text puts news_text

Try to understand what the code is doing in each step. And apply the knowledge in your future projects. Take help from these pages:

http://wiki.github.com/why/hpricot/an-hpricot-showcase

http://code.whytheluckystiff.net/doc/hpricot/

Recommend

  • Ruby - How to add a character at the beginning and end of a string
  • How to find which items in a MASSIVE array appear more than once?
  • How to work with $stdout
  • What does squeeze = True do in groupby?
  • Forcing a webpage to fit into a webview
  • How can I fix Ghost (glibc) BUG on Debian 6
  • Train neural network with sine function
  • Confusion with the order of execution when `next` with `unless` in ruby
  • Watir-webdriver timing out when asked if element is present?
  • Requiring gem in Rails 3 Controller failing with “Constant Missing”
  • css font-size and line-height not matching the baseline
  • How to extract a number from a string [duplicate]
  • How to wait for all async tasks to finish in Node.js?
  • How to use the resource module to measure the running time of a function?
  • Certain Arabic text gets incorrectly shown while other Arabic text gets showed normally?
  • How to implement Deep Linking in Roku SG application?
  • Sesame : how to remove the inference during queries?
  • HttpURLConnection.getOutputStream() takes 20 seconds. Why?
  • Google Places API - Find a company's CID and LRD
  • JSR-330 support in Picocontainer : @Inject … @Named(\"xxx)
  • SyntaxError: (irb):26: both block arg and actual block given
  • Creating a DropDownList
  • CXF JAXB JAXBEncoderDecoder unmarshalling error : unexpected element when having qualified elements
  • Who propagate bugfixes across branches (corporate development)?
  • Looking for good analogy/examples for monitor verses semaphore
  • Android Google Maps API v2 start navigation
  • System.InvalidCastException: Specified cast is not valid
  • Converting a WriteableBitmap image ToArray in UWP
  • ActiveRecord query for a count of new users by day
  • Dialing with Intent.ACTION_CALL stopps at # in phone number
  • Scrapy recursive link crawler
  • Why HTML5 Canvas with a larger size stretch a drawn line?
  • TFS: Get latest causes slow project reloading
  • Controls, properties, events and timers running in design time
  • Circular dependency while pushing http interceptor
  • How do I rollback to a specific git commit
  • SetUp method failed while running tests from teamcity
  • AngularJs get employee from factory
  • How to set the response of a form post action to a iframe source?
  • Change div Background jquery