extract basic count per values using the chunksize parameter in pandas

I have a CSV file with the following categories: item1,item2,item3,item4 which values is exactly one of the following: 0,1,2,3,4. I would like to count for each items how many are there for each value. My code is the following, df being the corresponding DataFrame:

outputDf = pandas.DataFrame() cat_list = list(df.columns.values) for col in cat_list: s = df.groupby(col).size() outputDf[col] = s

I would like to do exactly the same using the chunksize parameter when I read my CSV with read_csv, because my CSV is very big. My problem is: I can't find a way to find the cat_list, neither to build the outputDf.

Can someone give me a hint?

Answer1:

I'd apply value_counts columnwise rather than doing groupby:

>>> df = pd.read_csv("basic.csv", usecols=["item1", "item2", "item3", "item4"]) >>> df.apply(pd.value_counts) item1 item2 item3 item4 0 17 26 17 20 1 21 21 22 19 2 17 18 22 23 3 24 14 20 24 4 21 21 19 14

And for the chunked version, we just need to assemble the parts (making sure to fillna(0) so that if a part doesn't have a 3, for example, we get 0 and not nan.)

>>> df_iter = pd.read_csv("basic.csv", usecols=["item1", "item2", "item3", "item4"], chunksize=10) >>> sum(c.apply(pd.value_counts).fillna(0) for c in df_iter) item1 item2 item3 item4 0 17 26 17 20 1 21 21 22 19 2 17 18 22 23 3 24 14 20 24 4 21 21 19 14

(Of course, in practice you'd probably want to use as large a chunksize as you can get away with.)

人吐槽 人点赞

Recommend

Comment

用户名: 密码:
验证码: 匿名发表

你可以使用这些语言

查看评论:extract basic count per values using the chunksize parameter in pandas