Using Pandas read_csv() on an open file twice

As I was experimenting with pandas, I noticed some odd behavior of pandas.read_csv and was wondering if someone with more experience could explain what might be causing it.

To start, here is my basic class definition for creating a new pandas.dataframe from a .csv file:

import pandas as pd class dataMatrix: def __init__(self, filepath): self.path = filepath # File path to the target .csv file. self.csvfile = open(filepath) # Open file. self.csvdataframe = pd.read_csv(self.csvfile)

Now, this works pretty well and calling the class in my __ main __.py successfully creates a pandas dataframe:

From dataMatrix.py import dataMatrix testObject = dataMatrix('/path/to/csv/file')

But I was noticing that this process was automatically setting the first row of the .csv as the pandas.dataframe.columns index. Instead, I decided to number the columns. Since I didn't want to assume I knew the number of columns before hand, I took the approach of opening the file, loading it into a dataframe, counting the columns, and then reloading the dataframe with the proper number of columns using range().

import pandas as pd class dataMatrix: def __init__(self, filepath): self.path = filepath self.csvfile = open(filepath) # Load the .csv file to count the columns. self.csvdataframe = pd.read_csv(self.csvfile) # Count the columns. self.numcolumns = len(self.csvdataframe.columns) # Re-load the .csv file, manually setting the column names to their # number. self.csvdataframe = pd.read_csv(self.csvfile, names=range(self.numcolumns))

Keeping my processing in __ main __.py the same, I got back a dataframe with the correct number of columns (500 in this case) with proper names (0...499), but it was otherwise empty (no row data).

Scratching my head, I decided to close self.csvfile and reload it like so:

import pandas as pd class dataMatrix: def __init__(self, filepath): self.path = filepath self.csvfile = open(filepath) # Load the .csv file to count the columns. self.csvdataframe = pd.read_csv(self.csvfile) # Count the columns. self.numcolumns = len(self.csvdataframe.columns) # Close the .csv file. #<---- +++++++ self.csvfile.close() #<---- Added # Re-open file. #<---- Block self.csvfile = open(filepath) #<---- +++++++ # Re-load the .csv file, manually setting the column names to their # number. self.csvdataframe = pd.read_csv(self.csvfile, names=range(self.numcolumns))

Closing the file and re-opening it returned correctly with a pandas.dataframe with columns numbered 0...499 and all 255 subsequent rows of data.

My question is why does closing the file and re-opening it make a difference?

Answer1:

When you open a file with

open(filepath)

a file handle iterator is returned. An iterator is good for one pass through its contents. So

self.csvdataframe = pd.read_csv(self.csvfile)

reads the contents and exhausts the iterator. Subsequent calls to pd.read_csv thinks the iterator is empty.

Note that you could avoid this problem by just passing the file path to pd.read_csv:

class dataMatrix: def __init__(self, filepath): self.path = filepath # Load the .csv file to count the columns. self.csvdataframe = pd.read_csv(filepath) # Count the columns. self.numcolumns = len(self.csvdataframe.columns) # Re-load the .csv file, manually setting the column names to their # number. self.csvdataframe = pd.read_csv(filepath, names=range(self.numcolumns))

pd.read_csv will then open (and close) the file for you.

PS. Another option is to reset the file handle to the beginning of the file by calling self.csvfile.seek(0), but using pd.read_csv(filepath, ...) is still easier.

<hr>

Even better, instead of calling pd.read_csv twice (which is inefficient), you could rename the columns like this:

class dataMatrix: def __init__(self, filepath): self.path = filepath # Load the .csv file to count the columns. self.csvdataframe = pd.read_csv(filepath) self.numcolumns = len(self.csvdataframe.columns) self.csvdataframe.columns = range(self.numcolumns)

人吐槽 人点赞

Recommend

  • copying a single-index DataFrame into a MultiIndex DataFrame
  • Getting date ranges for multiple datetime pairs
  • RawPrinterHelper not printing and not raising error?
  • Python: How to overlay 2 bar plots from pandas plot
  • How can I have separate columns of a dataframe out of list of tuples and a list
  • Best way to Pivot/Rotate Data Set
  • Which JVM Flag sets the GC overhead threshold mentioned in the G1Ergonomics log?
  • Pandas DatetimeIndex indexing dtype: datetime64 vs Timestamp
  • Multiindex scatter plot
  • How to collect samples in multiple csv files
  • DataFrame, apply, lambda, list comprehension
  • How to plot different parts of same Pandas Series column with different colors? [duplicate]
  • Creating one dataframe from another (using pivot)
  • How to split series in two columns pandas
  • How to load space separate file into pandas dataframe?
  • Convert text table to pandas dataframe
  • How can DataFrames be merged such that the values of one that correspond to *dates* get applied to a
  • How to change the format of date in a dataframe?
  • Is there a way to horizontally concatenate dataframes of same length while ignoring the index?
  • pandas concat/merge and sum one column
  • Remove NaNs from DataFrame and duplicates from multi-index
  • Pandas `agg` to list, “AttributeError / ValueError: Function does not reduce”
  • Seaborn PairGrid: show axes tick-labels for each subplot
  • Cannot get the average date using pandas
  • Move object without a move constructor
  • Fastest method of finding data from another row in Pandas DataFrame based upon column data calculati
  • Python Excel Highlight Cell Differences
  • How to combine multiple csv into one file in serial manner using python?
  • Pandas sort list of str.split()
  • SettingWithCopyWarning while using .loc
  • Saving a dictionary into an .XLSX
  • Drop Rows by Multiple Column Criteria in DataFrame
  • assigning dataframe to Panel in Pandas
  • How to increase MongoDB performance
  • How to modify a pandas DataFrame in a function so that changes are seen by the caller?
  • convert list values to rows in pandas
  • Python equivalent of R c() function?
  • The best way to mark (split?) dataset in each string
  • How to (re)name an empty column header in a pandas dataframe without exporting to csv
  • dynamic cast not working for non polymorphic base class?
  • Canceling async httpwebrequests
  • How to do “(df1 & not df2)” dataframe merge in pandas?
  • iPython Notebook not printing Dataframe as table
  • How to remove just the index name and not the content in Pandas multiindex data frame
  • Pandas - find nearest dates between two DataFrames without loop
  • Replace any string in columns with 1
  • How to make Plotly chart with year mapped to line color and months on x-axis
  • How to filter on year and quarter in pandas
  • Color time-series based on column values in pandas
  • R convert summary result (statistics with all dataframe columns) into dataframe
  • Cannot upload to OneDrive using the new SDK
  • TextToSpeech.setEngineByPackageName() triggers NullPointerException
  • Django simple Captcha “No module named fields” error
  • How to know when stdin is empty if it contains EOF?
  • Error when parsing timestamp with pandas read_csv
  • Change Inet root folder for iis 7
  • Apache 2.4 and php-fpm does not trigger apache http basic auth for php pages
  • How do I configure my settings file to work with unit tests?
  • unknown Exception android
  • Checking variable from a different class in C#
  • How do I use LINQ to get all the Items that have a particular SubItem?
  • Comment

    用户名: 密码:
    验证码: 匿名发表

    你可以使用这些语言

    查看评论:Using Pandas read_csv() on an open file twice