39809

Write 100 million files to s3

My main aim is to split out records into files according to the ids of each record, and there are over 15 billion records right now which can certainly increase. I need a scalable solution using Amazon EMR. I have already got this done for a smaller dataset having around 900 million records.

Input files are in csv format, with one of the field which is need to be the file name in the output. So say that there are following input records:

awesomeId1, somedetail1, somedetail2 awesomeID1, somedetail3, somedetail4 awesomeID2, somedetail5, somedetail6

So now 2 files should be as output, one named awesomeID1.dat and other as awesomeID2.dat, each having records pertaining to respective IDs.

Size of the input: Total 600 GB (size of gzippef files) per month, each files is around 2 3 GB. And I need to process it for around 6 months or more at a time. so Total data size would be 6*600 GB (compressed).

Previously I was getting Too many open files error when I was using FileByKeyTextOutputFormat extends MultipleTextOutputFormat<Text, Text> to write to s3 according to the id value. Then as I have explained here, instead of writing every file directly to s3, I wrote them locally and moved to s3 in batches of 1024 files.

But now with increased amount of data, I am getting following message from s3 and then it skips writing the file in question : "Please reduce your request rate." Also I am having to run on a cluster with 200 m1.xlarge machines which then take around 2 hours, and hence it is very costly too!

I would like to have a <strong>scalable</strong> solution which shall not fail if amount of data increases again in future.

Any Suggestions?

Answer1:

Here is some info on SlowDown errors: https://forums.aws.amazon.com/message.jspa?messageID=89722#89816 You should insert into S3 in alphabetical order. Also the limit is dynamic and re-adjusts over time, so slow down and try to increase your rate later.

Perhaps you are better off using a database than a filesystem? How big is the total dataset?

DynamoDB may be a good fit, but may be expensive at $1/GB/month. (Since it uses SSD for backing storage.)

RDS is another option. Its pricing is from $0.10/GB/month.

Even better may be to host your own NoSQL or other datastore on EC2, such as on the new hs1.8xlarge instance. You can launch it only when you need it, and back it up to S3 when you don't.

Recommend

  • Using standard evaluation with a udf in dplyr
  • rcharts nvd3 linechart with categorical x axis
  • R plotting, date on x axis
  • Distance matrix in R
  • Define the file path from the file name in R
  • Scatter plot with factor on horizontal axis
  • Plot ROC curve and calculate AUC in R at specific cutoff info
  • Visual Studio: Garbled debug watch of std::string's?
  • What is the difference between running in VS 2010 and running a builded EXE?
  • Converting a data frame into named object in R
  • PE file - what's missing?
  • Referring to individual variables in … with dplyr quos
  • Convert data type in R or Python
  • How to get list of users who's birthday is today in MongoDB
  • How do I prepend to a stream in Bash?
  • @Autowired for @ModelAttribute
  • How to use arithmetic operators with SAS macro variables [duplicate]
  • jquery full calendar
  • SQL: Getting the physical size of a subset of a table
  • pip in virtualenv gets ConnectTimeoutError
  • Find Previous month name using Calender or any classes that in java
  • Memory error in python- how to use more memory
  • Invalid Date on validation Date of js
  • Array with custom indexes in Ionic2
  • Marklogic : Query response time is very high
  • How can I sort a a table with VBA with given text condition?
  • Change multiple background-images with jQuery
  • R - Combining Columns to String Based on Logical Match
  • Android screen density dpi vs ppi
  • Optimizing database types to compact database (SQLite)
  • DirectX11 ClearRenderTargetViewback with transparent buffer?
  • Does CUDA 5 support STL or THRUST inside the device code?
  • Change an a tag attribute in JavaScript based on screen width
  • File upload with ng-file-upload throwing error
  • Understanding cpu registers
  • How do I configure my settings file to work with unit tests?
  • Can Visual Studio XAML designer handle font family names with spaces as a resource?
  • How can I remove ASP.NET Designer.cs files?
  • Are Kotlin's Float, Int etc optimised to built-in types in the JVM? [duplicate]
  • How do I use LINQ to get all the Items that have a particular SubItem?