My main aim is to split out records into files according to the ids of each record, and there are over 15 billion records right now which can certainly increase. I need a scalable solution using Amazon EMR. I have already got this done for a smaller dataset having around 900 million records.
Input files are in csv format, with one of the field which is need to be the file name in the output. So say that there are following input records:
awesomeId1, somedetail1, somedetail2 awesomeID1, somedetail3, somedetail4 awesomeID2, somedetail5, somedetail6
So now 2 files should be as output, one named
awesomeID1.dat and other as
awesomeID2.dat, each having records pertaining to respective IDs.
Size of the input: Total 600 GB (size of gzippef files) per month, each files is around 2 3 GB. And I need to process it for around 6 months or more at a time. so Total data size would be 6*600 GB (compressed).
Previously I was getting
Too many open files error when I was using
FileByKeyTextOutputFormat extends MultipleTextOutputFormat<Text, Text> to write to s3 according to the id value. Then as I have explained here, instead of writing every file directly to s3, I wrote them locally and moved to s3 in batches of 1024 files.
But now with increased amount of data, I am getting following message from s3 and then it skips writing the file in question :
"Please reduce your request rate." Also I am having to run on a cluster with 200 m1.xlarge machines which then take around 2 hours, and hence it is very costly too!
I would like to have a <strong>scalable</strong> solution which shall not fail if amount of data increases again in future.
Here is some info on SlowDown errors: https://forums.aws.amazon.com/message.jspa?messageID=89722#89816 You should insert into S3 in alphabetical order. Also the limit is dynamic and re-adjusts over time, so slow down and try to increase your rate later.
Perhaps you are better off using a database than a filesystem? How big is the total dataset?
DynamoDB may be a good fit, but may be expensive at $1/GB/month. (Since it uses SSD for backing storage.)
RDS is another option. Its pricing is from $0.10/GB/month.
Even better may be to host your own NoSQL or other datastore on EC2, such as on the new hs1.8xlarge instance. You can launch it only when you need it, and back it up to S3 when you don't.