88437

extract basic count per values using the chunksize parameter in pandas

I have a CSV file with the following categories: item1,item2,item3,item4 which values is exactly one of the following: 0,1,2,3,4. I would like to count for each items how many are there for each value. My code is the following, df being the corresponding DataFrame:

outputDf = pandas.DataFrame() cat_list = list(df.columns.values) for col in cat_list: s = df.groupby(col).size() outputDf[col] = s

I would like to do exactly the same using the chunksize parameter when I read my CSV with read_csv, because my CSV is very big. My problem is: I can't find a way to find the cat_list, neither to build the outputDf.

Can someone give me a hint?

Answer1:

I'd apply value_counts columnwise rather than doing groupby:

>>> df = pd.read_csv("basic.csv", usecols=["item1", "item2", "item3", "item4"]) >>> df.apply(pd.value_counts) item1 item2 item3 item4 0 17 26 17 20 1 21 21 22 19 2 17 18 22 23 3 24 14 20 24 4 21 21 19 14

And for the chunked version, we just need to assemble the parts (making sure to fillna(0) so that if a part doesn't have a 3, for example, we get 0 and not nan.)

>>> df_iter = pd.read_csv("basic.csv", usecols=["item1", "item2", "item3", "item4"], chunksize=10) >>> sum(c.apply(pd.value_counts).fillna(0) for c in df_iter) item1 item2 item3 item4 0 17 26 17 20 1 21 21 22 19 2 17 18 22 23 3 24 14 20 24 4 21 21 19 14

(Of course, in practice you'd probably want to use as large a chunksize as you can get away with.)

Recommend

  • How can display differences of two matrices by subtraction via heatmap in python?
  • Plotting learning curve in keras gives KeyError: 'val_acc'
  • Key error when selecting columns in pandas dataframe after read_csv
  • Helper function code python
  • Python equivalent to R caTools random 'sample.split'
  • how to load a file with date and time as a datetime object in python?
  • convert scientific notation to decimal pandas python
  • Convert pandas._period.Period type Column names to Lowercase
  • Reading the contents of a dropbox shared csv file with Python
  • How to get specific number of rows based on column values in dataframe [duplicate]
  • Pandas error: Can only use .str accessor with string values, which use np.object_ dtype in pandas
  • How to collect samples in multiple csv files
  • Paste a row from a dataframe to match the length of rows of another dataframe
  • Pandas Read CSV with string delimiters via regex
  • How to (re)name an empty column header in a pandas dataframe without exporting to csv
  • Compare a column between 2 csv files and write differences using Python
  • href inside href [duplicate]
  • Parsing Data From Long to Wide Format in Python
  • Pandas groupby to to_csv
  • Get the last date of each month in a list of dates in Python
  • How to concat Pandas dataframe columns
  • Primefaces :radioButton inside a ui:repeat
  • R convert summary result (statistics with all dataframe columns) into dataframe
  • Group list of tuples by item
  • Breaking out column by groups in Pandas
  • gspread or such: help me get cell coordinates (not value)
  • Unable to get column index with table.getColumn method using custom table Model
  • Needing to do .toArray() to get output of mongodb .find() on key name not value
  • Error when parsing timestamp with pandas read_csv
  • Symfony2: How to get request parameter
  • ORA-29908: missing primary invocation for ancillary operator
  • align graphs with different xlab
  • Python: how to group similar lists together in a list of lists?
  • SQL merge duplicate rows and join values that are different
  • Turn off referential integrity in Derby? is it possible?
  • LevelDB C iterator
  • Add sale price programmatically to product variations
  • Can't mass-assign protected attributes when import data from csv file
  • Unable to use reactive element in my shiny app
  • How do I use LINQ to get all the Items that have a particular SubItem?