38052

Reformatting scraped selenium table

Question:

I'm scraping a <a href="http://www.oddsportal.com/hockey/usa/nhl-2014-2015/results/#/page/2.html" rel="nofollow">table</a> that displays info for a sporting league. So far so good for a selenium beginner:

from selenium import webdriver import re import pandas as pd driver = webdriver.PhantomJS(executable_path=r'C:/.../bin/phantomjs.exe') driver.get("http://www.oddsportal.com/hockey/usa/nhl-2014-2015/results/#/page/2.html") infotable = driver.find_elements_by_class_name("table-main") matches = driver.find_elements_by_class_name("table-participant") ilist, match = [], [] for i in infotable: ilist.append(i.text) infolist = ilist[0] for i in matches: match.append(i.text) driver.close() home = pd.Series([item.split(' - ')[0] for item in match]) away = pd.Series([item.strip().split(' - ')[1] for item in match]) df = pd.DataFrame({'home' : home, 'away' : away}) date = re.findall("\d\d\s\w\w\w\s\d\d\d\d", infolist)

In the last line, date scrapes all the dates in the table but I can't link them to the corresponding game.

My thinking is: for child/element "under the date", date = last_found_date.

Ultimate goal is to have two more columns in df, one with the date of the match and the next if any text found beside the date, for example 'Play Offs' (I can figure that out myself if I can get the date issue sorted).

Should I be incorporating another program/method to retain order of tags/elements of the table?

Answer1:

You would need to change the way you extract the match information. Instead of separately extracting home and away teams, do it in one loop also extracting the dates and events:

from selenium import webdriver import pandas as pd driver = webdriver.PhantomJS() driver.get("http://www.oddsportal.com/hockey/usa/nhl-2014-2015/results/#/page/2.html") data = [] for match in driver.find_elements_by_css_selector("div#tournamentTable tr.deactivate"): home, away = match.find_element_by_class_name("table-participant").text.split(" - ") date = match.find_element_by_xpath(".//preceding::th[contains(@class, 'first2')][1]").text if " - " in date: date, event = date.split(" - ") else: event = "Not specified" data.append({ "home": home.strip(), "away": away.strip(), "date": date.strip(), "event": event.strip() }) driver.close() df = pd.DataFrame(data) print(df)

Prints:

away date event home 0 Washington Capitals 25 Apr 2015 Play Offs New York Islanders 1 Minnesota Wild 25 Apr 2015 Play Offs St.Louis Blues 2 Ottawa Senators 25 Apr 2015 Play Offs Montreal Canadiens 3 Pittsburgh Penguins 25 Apr 2015 Play Offs New York Rangers 4 Calgary Flames 24 Apr 2015 Play Offs Vancouver Canucks 5 Chicago Blackhawks 24 Apr 2015 Play Offs Nashville Predators 6 Tampa Bay Lightning 24 Apr 2015 Play Offs Detroit Red Wings 7 New York Islanders 24 Apr 2015 Play Offs Washington Capitals 8 St.Louis Blues 23 Apr 2015 Play Offs Minnesota Wild 9 Anaheim Ducks 23 Apr 2015 Play Offs Winnipeg Jets 10 Montreal Canadiens 23 Apr 2015 Play Offs Ottawa Senators 11 New York Rangers 23 Apr 2015 Play Offs Pittsburgh Penguins 12 Vancouver Canucks 22 Apr 2015 Play Offs Calgary Flames 13 Nashville Predators 22 Apr 2015 Play Offs Chicago Blackhawks 14 Washington Capitals 22 Apr 2015 Play Offs New York Islanders 15 Tampa Bay Lightning 22 Apr 2015 Play Offs Detroit Red Wings 16 Anaheim Ducks 21 Apr 2015 Play Offs Winnipeg Jets 17 St.Louis Blues 21 Apr 2015 Play Offs Minnesota Wild 18 New York Rangers 21 Apr 2015 Play Offs Pittsburgh Penguins 19 Vancouver Canucks 20 Apr 2015 Play Offs Calgary Flames 20 Montreal Canadiens 20 Apr 2015 Play Offs Ottawa Senators 21 Nashville Predators 19 Apr 2015 Play Offs Chicago Blackhawks 22 Washington Capitals 19 Apr 2015 Play Offs New York Islanders 23 Winnipeg Jets 19 Apr 2015 Play Offs Anaheim Ducks 24 Pittsburgh Penguins 19 Apr 2015 Play Offs New York Rangers 25 Minnesota Wild 18 Apr 2015 Play Offs St.Louis Blues 26 Detroit Red Wings 18 Apr 2015 Play Offs Tampa Bay Lightning 27 Calgary Flames 18 Apr 2015 Play Offs Vancouver Canucks 28 Chicago Blackhawks 18 Apr 2015 Play Offs Nashville Predators 29 Ottawa Senators 18 Apr 2015 Play Offs Montreal Canadiens 30 New York Islanders 18 Apr 2015 Play Offs Washington Capitals 31 Winnipeg Jets 17 Apr 2015 Play Offs Anaheim Ducks 32 Minnesota Wild 17 Apr 2015 Play Offs St.Louis Blues 33 Detroit Red Wings 17 Apr 2015 Play Offs Tampa Bay Lightning 34 Pittsburgh Penguins 17 Apr 2015 Play Offs New York Rangers 35 Calgary Flames 16 Apr 2015 Play Offs Vancouver Canucks 36 Chicago Blackhawks 16 Apr 2015 Play Offs Nashville Predators 37 Ottawa Senators 16 Apr 2015 Play Offs Montreal Canadiens 38 New York Islanders 16 Apr 2015 Play Offs Washington Capitals 39 Edmonton Oilers 12 Apr 2015 Not specified Vancouver Canucks 40 Anaheim Ducks 12 Apr 2015 Not specified Arizona Coyotes 41 Chicago Blackhawks 12 Apr 2015 Not specified Colorado Avalanche 42 Nashville Predators 12 Apr 2015 Not specified Dallas Stars 43 Boston Bruins 12 Apr 2015 Not specified Tampa Bay Lightning 44 Pittsburgh Penguins 12 Apr 2015 Not specified Buffalo Sabres 45 Detroit Red Wings 12 Apr 2015 Not specified Carolina Hurricanes 46 New Jersey Devils 12 Apr 2015 Not specified Florida Panthers 47 Columbus Blue Jackets 12 Apr 2015 Not specified New York Islanders 48 Montreal Canadiens 12 Apr 2015 Not specified Toronto Maple Leafs 49 Calgary Flames 11 Apr 2015 Not specified Winnipeg Jets

Recommend

  • Input text only accepts numbers
  • 'SwingUtilities.updateComponentTreeUI(this)' removes custom Document from JComboBox
  • How Do You Convert a Page-Based PHP Application to MVC?
  • Reorganizing dataframe with multiple header types following “tidy” approach in R
  • covariance matrix by group
  • Xamarin Forms/Prism Custom Popup
  • How do I limit the amount of characters in JTextPane as the user types (Java)
  • Regex substring one mismatch in any location of string
  • Unload image of UIImageView thats offscreen
  • Python find continuous interesctions of intervals
  • MySQL multiple IN conditions to subquery with same table
  • Specifying virtual keyboard type for EditText in XML
  • How to turn (A, B, C) into (AB, AC, BC) with Pig?
  • What is the difference between a “service account” and an “installed application”?
  • Is it possible to get the word under the mouse cursor in a ``?
  • cell spacing in div table
  • BeautifulSoup difference between findAll and findChildren
  • Creating PDF from TIFF image using iText
  • NHibernate manually control fetching
  • XSLT foreach repeating nodes to flat
  • How to create a 2D image by rotating 1D vector of numbers around its center element?
  • Primefaces :radioButton inside a ui:repeat
  • Clear fused location provider's location for testing
  • R convert summary result (statistics with all dataframe columns) into dataframe
  • Thread 1: EXC_BAD_ACCESS (code =1 address = 0x0)
  • Breaking out column by groups in Pandas
  • Unable to get column index with table.getColumn method using custom table Model
  • how to save the state in userdefaults of accessory checkmark-iphone
  • Sort List of Strings By Version
  • How to suppress a dialog
  • Limiting recursion to certain level - Duplicate rows
  • Breeze - Deleted Items nav properties bug
  • Display issues when we change from one jquery mobile page to another in firefox
  • Different response to non-authenticated users and AJAX calls
  • javaw.exe and eclipse startup problems
  • Arrow is showed instead of the material design version hamburger icon. Why doesn't syncState in
  • Adding custom controls to a full screen movie
  • Data Validation Drop Down Box Arrow Disappearing
  • Rails 2: use form_for to build a form covering multiple objects of the same class
  • need help with bizarre java.net.HttpURLConnection behavior