21321

What are the advantages and disadvantages of reading an entire file into a single String as opposed

Specifically, my end goal is to store every comma separated word from the file in a List<String> and I was wondering which approach I should take.

Approach 1:

String fileContents = new Scanner(new File("filepath")).useDelimiter("\\Z").next(); List<String> list = Arrays.asList(fileContents.split("\\s*,\\s*"));

Approach 2:

Scanner s = new Scanner(new File("filepath")).useDelimiter(","); List<String> list = new ArrayList<>(); while (s.hasNext()){ list.add(s.next()); } s.close();

Answer1:

Approach #1 will read the entire file into memory. This has a couple of performance-related issues:

    <li>If the file is big that uses a lot of memory.</li> <li>Because of the way that the character's need to be accumulated by the Scanner.next() call, the characters may need to be copied 2 or even 3 times.</li> <li>There are other inefficiencies due to the fact that you are using a general pattern matching engine for a very specific purpose. </li> </ul>

    Approach #3 (which is Approach #1 with the File reading done better) addresses a lot of the efficiency issues, but you still hold the entire file contents in memory.

    Approach #2 is best from memory usage perspective because you don't hold the entire file contents as a single string or buffer<sup>1</sup>. The performance is also likely to be best because (my intuition says) this approach avoids at least one copy of the characters.

    However, if this really matters, you should benchmark the alternatives, bearing in mind 2 things:

      <li>"Premature optimization" is usually wasted effort. (Or to put it another, the chances are that the performance of this part of your code really doesn't matter. The performance bottleneck is likely somewhere else.)</li> <li>There a lot of pitfalls for writing Java benchmarks that can lead to bogus performance measures and incorrect conclusions.</li> </ul> <hr>

      The other thing to note is that what you are trying to do (create a list of all "words" in order) does not scale. For a large enough input file, the application will run out of heap space. If you anticipate running this on input files larger than 100Mb or so, it may start to become a concern.

      The solution may be to convert your processing into something that is more "stream" based ... so that you don't need to have a list of all words in memory.

      This is essentially the same problem as the problem with Approach #1.

      <hr>

      <sup>1 - unless the file is small and fits into the buffer ... and then the whole question is largely moot.</sup>

      Answer2:

      If you read the entire file into memory when you don't actually need to you are:

        <li>wasting time: nothing is processed until you've read the entire file</li> <li>wasting space</li> <li>using a technique that won't scale to large files.</li> </ul>

        Doing this has nothing to recommend it.

        Answer3:

        Approach 1:

        Limit of String's maximum size i.e. a String of max length Integer.MAX_VALUE only is possible or the largest possible array at runtime

        Hence, Prefer Approach 2 if it is a very large fie

Recommend

  • Print a string char by char with a delay after each char
  • Inserting rows to tableView using modal popup
  • Using AWK to process two different files consecutively
  • Lead() and LAG() functionality in SQL Server 2008
  • Simple Factory with reflection C#
  • Is there any purpose for h2-h6 headings in HTML5?
  • Let a function return any type in C++ class
  • Regex for Specific Tag
  • JavaScript IE rotation transform maths
  • C# - Most efficient way to iterate through multiple arrays/list
  • MySQL: Update rows in table by iterating and joining with another one
  • Consuming a WCF service in a Java Client using wsHttpBinding
  • Installing PHP 7 on digitalocean
  • ThreadStatic in asynchronous ASP.NET Web API
  • Trying to get the char code of ENTER key
  • Why use database factory in asp.net mvc?
  • Is it possible to open regedit and navigate to straight to a specific key using process.start?
  • Not able to aggregate on nested fields in elasticsearch
  • What's the purpose of QString?
  • How to determine if there are bytes available to be read from boost:asio:serial_port
  • Installed module is empty
  • WPF - CanExecute dosn't fire when raising Commands from a UserControl
  • Swift: Switch statement fallthrough behavior
  • Error when parsing timestamp with pandas read_csv
  • Reading JSON from a file using C++ REST SDK (Casablanca)
  • Asynchronous UI Testing in Xcode With Swift
  • What is the “return” in scheme?
  • Excel - Autoshape get it's name from cell (value)
  • WinForms: two way TextBox problem
  • How can I use Kendo UI with Razor?
  • ActionScript 2 vs ActionScript 3 performance
  • R: gsub and capture
  • Is there a mandatory requirement to switch app.yaml?
  • SetUp method failed while running tests from teamcity
  • Matrix multiplication with MKL
  • How to disable jQuery.jplayer autoplay?
  • CSS Applying specific rule for a specific monitor resolution with only CSS is posible?
  • Why joiner is not used after Sequence generator or Update statergy
  • How to get NHibernate ISession to cache entity not retrieved by primary key
  • Converting MP3 duration time