33562

Handling “”, “-” CSV with Univocity

Question:

Any idea how I can get proper lines? some lines are getting glued, and I can't figure out how to stop it or why.

col. 0: Date col. 1: Col2 col. 2: Col3 col. 3: Col4 col. 4: Col5 col. 5: Col6 col. 6: Col7 col. 7: Col7 col. 8: Col8 col. 0: 2017-05-23 col. 1: String col. 2: lo rem ipsum col. 3: dolor sit amet col. 4: mcdonalds.com/online.html col. 5: null col. 6: "","-""-""2017-05-23" col. 7: String col. 8: lo rem ipsum col. 9: dolor sit amet col. 10: burgerking.com col. 11: https://burgerking.com/ col. 12: 20 col. 13: 2 col. 14: fake col. 0: 2017-05-23 col. 1: String col. 2: lo rem ipsum col. 3: dolor sit amet col. 4: wendys.com col. 5: null col. 6: "","-""-""2017-05-23" col. 7: String col. 8: lo rem ipsum col. 9: dolor sit amet col. 10: buggagump.com col. 11: null col. 12: "","-""-""2017-05-23" col. 13: String col. 14: cheese col. 15: ad eum col. 16: mcdonalds.com/online.html col. 17: null col. 18: "","-""-""2017-05-23" col. 19: String col. 20: burger col. 21: ludus dissentiet col. 22: www.mcdonalds.com col. 23: https://www.mcdonalds.com/ col. 24: 25 col. 25: 3 col. 26: fake col. 0: 2017-05-23 col. 1: String col. 2: wine col. 3: id erat utamur col. 4: bubbagump.com col. 5: https://buggagump.com/ col. 6: 25 col. 7: 3 col. 8: fake done

A sample CSV (the \r\n may have gotten corrupted when copy/pasting). Available here: <a href="https://www.dropbox.com/s/86klza4qok4ty2s/malformed%20csv%20r%20n%20small.csv?dl=0" rel="nofollow">https://www.dropbox.com/s/86klza4qok4ty2s/malformed%20csv%20r%20n%20small.csv?dl=0</a>

"Date","Col2","Col3","Col4","Col5","Col6","Col7","Col7","Col8" "2017-05-23","String","lo rem ipsum","dolor sit amet","mcdonalds.com/online.html","","-","-","-" "2017-05-23","String","lo rem ipsum","dolor sit amet","burgerking.com","https://burgerking.com/","20","2","fake" "2017-05-23","String","lo rem ipsum","dolor sit amet","wendys.com","","-","-","-" "2017-05-23","String","lo rem ipsum","dolor sit amet","buggagump.com","","-","-","-" "2017-05-23","String","cheese","ad eum","mcdonalds.com/online.html","","-","-","-" "2017-05-23","String","burger","ludus dissentiet","www.mcdonalds.com","https://www.mcdonalds.com/","25","3","fake" "2017-05-23","String","wine","id erat utamur","bubbagump.com","https://buggagump.com/","25","3","fake"

Building settings:

CsvParserSettings settings = new CsvParserSettings(); settings.setDelimiterDetectionEnabled(true); settings.setQuoteDetectionEnabled(true); settings.setLineSeparatorDetectionEnabled(false); // all the same using `true` settings.getFormat().setLineSeparator("\r\n"); CsvParser parser = new CsvParser(settings); List<String[]> rows; rows = parser.parseAll(getReader("testFiles/" + "malformed csv small.csv")); for (String[] row : rows) { System.out.println(""); int i = 0; for (String element : row) { System.out.println("col. " + i++ + ": " + element); } } System.out.println("done");

Answer1:

As you are testing the auto-detection process, I suggest you to print out the detected format with:

CsvFormat format = parser.getDetectedFormat(); System.out.println(format);

This will print out:

CsvFormat: Comment character=# Field delimiter=, Line separator (normalized)=\n Line separator sequence=\r\n Quote character=" Quote escape character=- Quote escape escape character=null

As you can see, the parser is not detecting the quote escape correctly. While the format detection process is typically very good, it is not guaranteed that it will always get it right, specially with small test samples. In your sample I can't see why it would pick up the - as the escape character, so I opened this <a href="https://github.com/uniVocity/univocity-parsers/issues/161" rel="nofollow">issue</a> to investigate and see what is making it detect that one.

What you can do right now as a workaround, if you know for a fact that none of your input files will never have - as the quote escape, is to detect the format, test what it picked up from the input, and then parse the contents, like this:

public List<String[]> parse(File input, CsvFormat format) { CsvParserSettings settings = new CsvParserSettings(); if (format == null) { //no format specified? Let's detect what we are dealing with settings.detectFormatAutomatically(); CsvParser parser = new CsvParser(settings); parser.beginParsing(input); //just call begin parsing to kick of the auto-detection process format = parser.getDetectedFormat(); //capture the format parser.stopParsing(); //stop the parser - no need to read anything yet. System.out.println(format); if (format.getQuoteEscape() == '-') { //got something weird detected? Let's amend it. format.setQuoteEscape('"'); } return parse(input, format); //now parse with the intended format } else { settings.setFormat(format); //this parses with the format adjusted earlier. CsvParser parser = new CsvParser(settings); return parser.parseAll(input); } }

Now just call the parse method:

List<String[]> rows = parse(new File("/Users/jbax/Downloads/malformed csv r n small.csv"), null);

And you will have your data properly extracted. Hope this helps!

Recommend

  • database disk image is malformed or file is encrypted or is not a database
  • How to print an array in rivets.js
  • Get page content from URL?
  • git svn clone: branches moved from repo/ to repo/branches/ and “Malformed XML: no element found”
  • Problems installing R on Linux CentOS 6.3
  • How to generate a CREATE script for several tables in pgAdmin III?
  • How do encryption algorithms know if they have the right key
  • How to make Regex ignore a pattern following a specific group
  • VBA application.match error 2015
  • Cannot convert (Timer!) -> Void to ((CFRunLoopTimer?) ->Void)! - Converting NSTimer extension
  • How to get the revision of an item with Dropbox API
  • Excel Not Responding During Macro
  • cannot run python script file using windows prompt
  • input type=“file” accept=“image/*” doesn't work in phone gap?
  • Index.php as custom error page
  • How to remove comma or any characters from Python dataframe column name
  • SSH in Bash Script Messing Up File Read
  • Boost binary serialization doesn't work occasionally. The parsed data is corrupted sometimes
  • add a publickey to server for SCP [closed]
  • Simplify where clause with repeated associated type restrictions
  • Efficient algorithm to find additions and removals from 2 collections
  • Guava how to copy all files from one directory to another
  • Add reference to ASP.NET 5 Class Library from Framework 4.5 Class Library Project
  • Most efficient way to move table rows from one table to another
  • Building Qt project for C++11 standard
  • How can I get the choice “H2” back in the H2 consol?
  • std::remove_copy_if_ valgrind bytes in block are possibly lost in loss record
  • SharedPreferences or SQLite Database?
  • Visual Studio 2010 debugger build correctly - compiler pdb and linker pdb not in synch?
  • How to get Eclipse Oxygen to run on Java 9
  • Textfile Structure (tables)
  • Rails Find when some params will be blank
  • Functions in global context
  • MailKit: The IMAP server replied to the 'EXAMINE' command with a 'BAD' response
  • Using $this when not in object context
  • Jenkins: How To Build multiple projects from a TFS repository?
  • Uncaught Error: Could not find module `ember-load-initializers`
  • XCode can't find symbols for a specific iOS library/framework project
  • Can't mass-assign protected attributes when import data from csv file
  • Programmatically clearing map cache