I've been trying now for some time to create a query that would count all rows from a table per day that include a column with certain id, and then group them to weekly values based on the UNIX timestamp column. I have a medium sized dataset with 37 million rows, and have been trying to run following kind of query:
SELECT DATE(timestamp), COUNT(*) FROM `table` WHERE ( date(timestamp) between "YYYY-MM-DD" and "YYYY-MM-DD" and column_group_id=X ) group by week(date(startdate))
Though I'm getting weird results, and the query doesn't group the counts correctly but shows too large values on the resulting count column (I verified the value errors by querying very small spesific datasets.)
If I group by
date(startdate) instead, the row counts match per day basis but I'd like to combine these daily amount of rows to weekly amounts. How this could be possible? The data is needed in format:
2006-01-01 | 5 2006-01-08 | 10
so that the day timestamp is the first column and second is the amount of rows per week.
Your query is non deterministic so it is not surprising you are getting unexpected results. By this I mean you could run this query on the same data 5 times and get 5 different result sets. This is due to the fact you are selecting
DATE(timestamp) but grouping by
WEEK(DATE(startdate)), the query is therefore returning the time of the first row it comes accross per startdate week in <strong>ANY</strong> order.
Consider the following 2 rows (with timestamp in date format for ease of reading):
TimeStamp StartDate 20120601 20120601 20120701 20120601
Your query is grouping by
WEEK(StartDate) which is 23, since both rows evaluate to the same value you would expect your results to have 1 row with a count of 2.
DATE(Timestamp) Is also in the select list and since there is no
ORDER BY statement the query has no idea which Timestamp to return '20120601' or '20120701'. So even on this small result set you have a 50:50 chance of getting:
TimeStamp COUNT 20120601 2
and a 50:50 chance of getting
TimeStamp COUNT 20120701 2
If you add more data to the dataset as so:
TimeStamp StartDate 20120601 20120601 20120701 20120601 20120701 20120701
You could get
TimeStamp COUNT 20120601 2 20120701 1
TimeStamp COUNT 20120701 2 20120701 1
You can see how with 37,000,000 rows you will soon get results that you do not expect and cannot predict!
Since it looks like you are trying to get the weekstart in your results, while group by week you could use the following to get the week start (replacing CURRENT_TIMESTAMP with whichever column you want):
SELECT DATE_ADD(CURRENT_TIMESTAMP, INTERVAL 1 - DAYOFWEEK(CURRENT_TIMESTAMP) DAY) AS WeekStart
You can then group by this date too to get weekly results and avoid the trouble of having things in your select list that aren't in your group by.
SELECT DATE(timestamp), COUNT(week(date(startdate))) FROM `table` WHERE ( date(timestamp) between "YYYY-MM-DD" and "YYYY-MM-DD" and column_group_id=X ) group by week(date(startdate))