41487

Ruby Mechanize screen scraping help

Question:

I am trying to scrape a row in a table with a date. I want to scrape only the third row that have the date today.

This is my mechanize code. I am trying to select the colum row witch have the date today and its and its columns:

agent.page.search("//td").map(&:text).map(&:strip)

Output: "11-02-2011", "1", "1", "1", "1", "0", "0,00 DKK", "0,00", "0,00 DKK", "12-02-2011", "5", "5", "1", "4", "0", "0,00 DKK", "0,00", "0,00 DKK", "14-02-2011", "1", "3", "1", "1", "0", "0,00 DKK", ",00", "0,00 DKK", "7", "9", "3", "6", "0", "0,00 DKK", "0,00", "0,00 DKK

"

I want to only scrape the third row that is the date today.

Answer1:

Rather than loop over the <td> tags using '//td', search for the <tr> tags, grab only the third one, then loop over '//td'.

Mechanize uses Nokogiri internally, so here's how to do it in Nokogiri-ese:

html = <<EOT <table> <tr><td>11-02-2011</td><td>1</td><td>1</td><td>1</td><td>1</td><td>0</td><td>0,00 DKK</td><td>0,00</td><td>0,00 DKK</td></tr> <tr><td>12-02-2011</td><td>5</td><td>5</td><td>1</td><td>4</td><td>0</td><td>0,00 DKK</td><td>0,00</td><td>0,00 DKK</td></tr> <tr><td>14-02-2011</td><td>1</td><td>3</td><td>1</td><td>1</td><td>0</td><td>0,00 DKK</td><td>,00</td><td>0,00 DKK</td></tr> </table> EOT require 'nokogiri' require 'pp' doc = Nokogiri::HTML(html) pp doc.search('//tr')[2].search('td').map{ |n| n.text } >> ["14-02-2011", "1", "3", "1", "1", "0", "0,00 DKK", ",00", "0,00 DKK"]

Use the .search('//tr')[2].search('td').map{ |n| n.text } appended to Mechanize's agent.page, like so:

agent.page.search('//tr')[2].search('td').map{ |n| n.text }

It's been a while since I played with Mechanize, so it might also be agent.page.parser....

<hr />

EDIT:

<blockquote>

there will come more rows in the table. The row that i want to scrape is always the second last.

</blockquote>

It's important to put that information into your original question. The more accurate your question, the more accurate our answers.

Recommend

  • How to negative match regex in JavaScript string replace? [duplicate]
  • Grails 3 - How to publish to Artifactory
  • Efficiently reading a csv file with windows newline on linux in Python
  • MySQL multiple IN conditions to subquery with same table
  • Android: How to correctly use NotifyDataSetChanged with SimpleExpandableListAdapter?
  • How to open html table in xls on click of a button
  • Is it possible to get the word under the mouse cursor in a ``?
  • how to get username into sql trigger when multiple users signed on from asp membership
  • BeautifulSoup difference between findAll and findChildren
  • NHibernate manually control fetching
  • Custom validator control occupying space even though display set to dynamic
  • Google Custom Search with transparent background
  • Insert into database using onclick function
  • What is Eclipse's Declaration View used for?
  • How to add a column to a Pandas dataframe made of arrays of the n-preceding values of another column
  • script to move all files from one location to another location
  • Modifying destination and filename of gulp-svg-sprite
  • MySQL WHERE-condition in procedure ignored
  • Deserializing XML into class C#
  • Can I make an Android app that runs a web view in Chrome 39?
  • Web-crawler for facebook in python
  • Function pointer “assignment from incompatible pointer type” only when using vararg ellipsis
  • Rearranging Cells in UITableView Bug & Saving Changes
  • Circular dependency while pushing http interceptor
  • Linker errors when using intrinsic function via function pointer
  • How to delete a row from a dynamic generate table using jquery?
  • trying to dynamically update Highchart column chart but series undefined
  • FormattedException instead of throw new Exception(string.Format(…)) in .NET
  • IndexOutOfRangeException on multidimensional array despite using GetLength check
  • python draw pie shapes with colour filled
  • Easiest way to encapsulate a HTML5 webpage into an android app?
  • Busy indicator not showing up in wpf window [duplicate]
  • Running Map reduces the dimensions of the matrices
  • costura.fody for a dll that references another dll
  • Observable and ngFor in Angular 2
  • How to Embed XSL into XML
  • UserPrincipal.Current returns apppool on IIS
  • Android Heatmap on canvas or ImageView
  • Conditional In-Line CSS for IE and Others?
  • java string with new operator and a literal