37335

Python - Map / Reduce - How do I read JSON specific field in using DISCO count words example

I'm following along with the DISCO example for counting words from a file:

Counting Words as a map/reduce job

I have no issues getting this working, however I want to try reading in a specific field from a text file that contains JSON strings.

The file has lines like:

{"favorited": false, "in_reply_to_user_id": 306846931, "contributors": null, "truncated": false, "text": "@CataDuarte8 No! av\u00edseme cuando vaya ah salir para yo salir igual!", "created_at": "Wed Apr 04 20:25:37 +0000 2012", "retweeted": false, "in_reply_to_status_id": 187636960632901632, "coordinates": null, "id": 187637067415683073, "entities": {"user_mentions": [{"indices": [0, 12], "id_str": "306846931", "id": 306846931, "name": "Catalina Ria\u00f1o!\u2661", "screen_name": "CataDuarte8"}], "hashtags": [], "urls": []}, "in_reply_to_status_id_str": "187636960632901632", "id_str": "187637067415683073", "in_reply_to_screen_name": "CataDuarte8", "user": {"follow_request_sent": null, "profile_use_background_image": true, "id": 286402064, "description": "Cada quien RECOJE lo que SIEMBRA (:\r\n\u2551\u258c\u2502\u2551\u2502\u2551\u258c\u2502\u2588\u2551\u2502\u2551\u258c\u2502\u2551\u258c\u2551 ", "verified": false, "profile_image_url_https": "https://si0.twimg.com/profile_images/1858805061/ginri_normal.jpg", "profile_sidebar_fill_color": "525252", "is_translator": false, "geo_enabled": false, "profile_text_color": "ffffff", "followers_count": 620, "protected": false, "location": "", "default_profile_image": false, "id_str": "286402064", "utc_offset": -21600, "statuses_count": 16395, "profile_background_color": "000000", "friends_count": 537, "profile_link_color": "ff0000", "profile_image_url": "http://a0.twimg.com/profile_images/1858805061/ginri_normal.jpg", "notifications": null, "show_all_inline_media": true, "profile_background_image_url_https": "https://si0.twimg.com/profile_background_images/419254765/Scan0004.jpg", "profile_background_image_url": "http://a0.twimg.com/profile_background_images/419254765/Scan0004.jpg", "screen_name": "LadyRomeroo", "lang": "es", "profile_background_tile": true, "favourites_count": 136, "name": "Lady Romero \u2605", "url": "http://www.facebook.com/profile.php?id=1640385164", "created_at": "Fri Apr 22 23:04:41 +0000 2011", "contributors_enabled": false, "time_zone": "Central Time (US & Canada)", "profile_sidebar_border_color": "0a5b80", "default_profile": false, "following": null, "listed_count": 0}, "place": null, "retweet_count": 0, "geo": null, "in_reply_to_user_id_str": "306846931", "source": "web"}

I'm only interested in the "text" key, value fields. In python I can do:

import simplejson f = open("file.json", "r") for line in f: r = simplejson.loads(line).get('text') print r

which returns all the text field values like:

@_MuitoMais_ ´vcs são d msm amei o pode ou ão pode e a entrevist com a @claudialeitte =)

This works fine, however when I try to apply this same method to the sample count_words.py example that comes with disco like so:

from disco.core import Job, result_iterator import simplejson def map(line, params): r = simplejson.loads(line).get('text') for word in r.split(): yield word, 1 def reduce(iter, params): from disco.util import kvgroup for word, counts in kvgroup(sorted(iter)): yield word, sum(counts) if __name__ == '__main__': job = Job().run(input=["/tmp/file.json"], map=map, reduce=reduce) for word, count in result_iterator(job.wait(show=True)): print word, count

I get the following error:

# python test.py Job@549:b4c76:9cbb1: Status: [map] 0 waiting, 1 running, 0 done, 0 failed 2012/11/24 02:01:10 master New job initialized! 2012/11/24 02:01:10 master Starting job 2012/11/24 02:01:10 master Starting map phase 2012/11/24 02:01:10 master map:0 assigned to comp1 2012/11/24 02:01:11 master ERROR: Job failed: Worker at 'comp1' died: Traceback (most recent call last): File "/home/DISCO/data/comp1/46/Job@549:b4c76:9cbb1/usr/local/lib/python2.7/site-packages/disco/worker/__init__.py", line 329, in main job.worker.start(task, job, **jobargs) File "/home/DISCO/data/comp1/46/Job@549:b4c76:9cbb1/usr/local/lib/python2.7/site-packages/disco/worker/__init__.py", line 290, in start self.run(task, job, **jobargs) File "/home/DISCO/data/comp1/46/Job@549:b4c76:9cbb1/usr/local/lib/python2.7/site-packages/disco/worker/classic/worker.py", line 286, in run getattr(self, task.mode)(task, params) File "/home/DISCO/data/comp1/46/Job@549:b4c76:9cbb1/usr/local/lib/python2.7/site-packages/disco/worker/classic/worker.py", line 302, in map part = str(self['partition'](key, self['partitions'], params)) File "/home/DISCO/data/comp1/46/Job@549:b4c76:9cbb1/usr/local/lib/python2.7/site-packages/disco/worker/classic/func.py", line 341, in default_partition return hash(str(key)) % nr_partitions UnicodeEncodeError: 'ascii' codec can't encode character u'\xb4' in position 0: ordinal not in range(128) 2012/11/24 02:01:11 master WARN: Job killed Status: [map] 1 waiting, 0 running, 0 done, 1 failed Traceback (most recent call last): File "test.py", line 18, in <module> for word, count in result_iterator(job.wait(show=True)): File "/usr/local/lib/python2.7/site-packages/disco/core.py", line 348, in wait timeout, poll_interval * 1000) File "/usr/local/lib/python2.7/site-packages/disco/core.py", line 309, in check_results raise JobError(Job(name=jobname, master=self), "Status %s" % status) disco.error.JobError: Job Job@549:b4c76:9cbb1 failed: Status dead

It seems like this should be straight forward but I'm obviously missing something.

Can anyone help?

Answer1:

Your problem is in disco/worker/classic/func.py... str() will not accept a unicode character...

>>> str(u'\xb4') Traceback (most recent call last): File "<stdin>", line 1, in <module> UnicodeEncodeError: 'ascii' codec can't encode character u'\xb4' in position 0: ordinal not in range(128) >>>

Since you are only counting words, you could convert your unicode data into strings with the unicodedata module...

import json import unicodedata f = open('file.json') for line in f: r = json.loads(line).get('text') s = unicodedata.normalize('NFD', r).encode('ascii', 'ignore') print r print s

Output:

@CataDuarte8 No! avíseme cuando vaya ah salir para yo salir igual! @CataDuarte8 No! aviseme cuando vaya ah salir para yo salir igual!

Applying this to your problem... rewrite your map() function as...

def map(line, params): r = simplejson.loads(line).get('text') s = unicodedata.normalize('NFD', r).encode('ascii', 'ignore') for word in s.split(): yield word, 1

Recommend

  • Merge pictures in PHP
  • Looping through a list of tuples to make a POST request with each tuple
  • .htaccess pretty url problem (mod_rewrite)
  • How do I print a binary number with an inputed integer?
  • Java ssl handshake failure (SSLPoke)
  • Can a malformed JSON string be parsed successfully?
  • how to calculate the dot product of two arrays of vectors in python? [duplicate]
  • No instance for (Fractional a0) arising from a use of ‘it’
  • Reading and Printing content from a txt file using Mips Assembly
  • EMV Reading PAN Code
  • How to append random number to a list
  • ioctl prototype in solaris libc
  • `react-native run-ios` returns Error: Could not find iPhone X simulator
  • pandas data frame removing the first row of every numbers
  • Scala error using IntelliJ remote.serverException
  • How to unnest (explode) a column in a pandas DataFrame?
  • COLUMNS_UPDATED() return different values when ColumnId is the same
  • Mapping a function over two input lists
  • Why is the order of bit fields in the bytes of structs not defined by the language itself?
  • Convert image data to format needed for gfa
  • Unable to add extension to use Azure VM extensions using Ansible
  • Why div 100% width doesn't work as expected
  • dpkg error: pycompile: not found
  • How to do a sum in python
  • Add up all elements of compile-time sized array most efficiently
  • create new column that compares across rows in pandas dataframe
  • Conditionally ignore primitive typed fields with Jackson
  • Fortran: Array of arbitrary dimension?
  • What's a good way to make a one-shot KVO observation?
  • Android GCM push notification without server OR GCM push notification using Microsoft SQL(Not Mysql)
  • Syncronizing database from Active Directory
  • how to reopen a class in gems
  • Can my app be notified when another application starts/stops playing audio?
  • QuartzCore.framework for Mono Develop
  • Python - Map / Reduce - How do I read JSON specific field in using DISCO count words example
  • Warning: Can't call setState (or forceUpdate) on an unmounted component
  • bootstrap to use multiple ng-app
  • How to get icons for entities from eclipse?
  • Turn off referential integrity in Derby? is it possible?
  • JaxB to read class hierarchy