31662

Splitting text into paragraphs with regex JAVA

I hava text file that contains some data. All paragraphs start with four spaces. My aim is to split this text into paragraphs.

First, I read the whole text using:

public String parseToString(String filePath) throws IOException{ return new String(Files.readAllBytes(Paths.get(filePath)), StandardCharsets.UTF_8); }

Then I use this code to split the string:

private static final String PARAGRAPH_SPLIT_REGEX = "(^\\s{4})"; public void parseText(String text) { String[] paragraphs = text.split(PARAGRAPH_SPLIT_REGEX); for (int i = 0; i < paragraphs.length; i++) { System.out.println("Paragraph: " + paragraphs[i]); } }

My input file is:

Hello, World! Hello, World!

And the output is:

Paragraph: Paragraph: Hello, World!!! Hello, World!!!

What am i doing wrong?

Answer1:

^ by default represents start of the string, not start of the line. If you want to it to represent start of the line you need to add multiline flag to your regex (?m).

Also consider using look-ahead which in Java 8 will automatically get rid of first empty result in your split array.

So try with this regex:

private static final String PARAGRAPH_SPLIT_REGEX = "(?m)(?=^\\s{4})";

To get rid of unwanted separators like spaces or new lines at start or end of your string you can simply use trim method like

public static void parseText(String text) { String[] paragraphs = text.split(PARAGRAPH_SPLIT_REGEX); for (String paragraph : paragraphs) { System.out.println("Paragraph: " + paragraph.trim()); } }

Example:

String s = " Hello, World!\r\n" + " Hello, World!\r\n" + " Hello, World!"; parseText(s);

Output:

Paragraph: Hello, World! Paragraph: Hello, World! Paragraph: Hello, World! <hr>

Pre Java 8 version:

If you need to use this code on older versions of Java then you will need to prevent splitting at start of the string (to avoid getting first element empty). To do this you can use (?!^) before miltiline flag. This way ^ before (?m) can still be representing only start of string, not start of the line. Or to be more explicit you can use \A which represents start of String regardless of multiline flag.

So pre Java 8 version of regex can look like

private static final String PARAGRAPH_SPLIT_REGEX = "(?!^)(?m)(?=^\\s{4})";

or

private static final String PARAGRAPH_SPLIT_REGEX = "(?m)(?!\\A)(?=^\\s{4})";

Answer2:

Your regex should be \\s{4} without the ^ in the beginning.

Recommend

  • Checking if a string is UTF-8 compatible for mySQL
  • NiFi execute script encrypt json
  • need help to understand the below jquery code
  • Can not find a deserializer for non-concrete Map type [map type; class javax.ws.rs.core.MultivaluedM
  • Rotating child elements in wrapper + Button Navigation
  • RoR: syntax error, unexpected tSTRING_BEG, expecting ')'
  • Flex-box text items overflow
  • How to make twilio work with a proxy in java
  • Store Bootstrap Data - Modal, Badges and Alerts
  • How to use a custom function in a jpa query?
  • Can't detect mouse wheel event in safari
  • Fixed positioned elements disappear on page unload
  • How to increase the python speed over loops?
  • Alternatives to format text of body email using Google Apps Script (MailApp.sendEmail)
  • Silverlight text trimming and wrapping issue
  • Linq Merge lists
  • presentShareDialogWithParams posts to FB wall, but callback handler results say error
  • how to adjust image in a panel in Java swing?
  • req.body is undefined - nodejs
  • Possible to stop flickering java tooltip in heavyweight mode?
  • output of program is not same as passed argument
  • script to move all files from one location to another location
  • sending/ receiving email in Java
  • How to set my toolbar fixed while scrolling android
  • Javascript + PHP Encryption with pidCrypt
  • AT Commands to Send SMS not working in Windows 8.1
  • How to delete a row from a dynamic generate table using jquery?
  • Windows forms listbox.selecteditem displaying “System.Data.DataRowView” instead of actual value
  • Proper way to use connect-multiparty with express.js?
  • Flexbox equal height doesn't work
  • JTable with a ScrollPane misbehaving
  • Getting Messege Twice Using IMvxMessenger
  • apache spark aggregate function using min value
  • How can I remove ASP.NET Designer.cs files?
  • Checking variable from a different class in C#
  • Sorting a 2D array using the second column C++
  • How can i traverse a binary tree from right to left in java?
  • java string with new operator and a literal
  • How can I use threading to 'tick' a timer to be accessed by other threads?
  • How do I use LINQ to get all the Items that have a particular SubItem?