Splitting text into paragraphs with regex JAVA

I hava text file that contains some data. All paragraphs start with four spaces. My aim is to split this text into paragraphs.

First, I read the whole text using:

public String parseToString(String filePath) throws IOException{ return new String(Files.readAllBytes(Paths.get(filePath)), StandardCharsets.UTF_8); }

Then I use this code to split the string:

private static final String PARAGRAPH_SPLIT_REGEX = "(^\\s{4})"; public void parseText(String text) { String[] paragraphs = text.split(PARAGRAPH_SPLIT_REGEX); for (int i = 0; i < paragraphs.length; i++) { System.out.println("Paragraph: " + paragraphs[i]); } }

My input file is:

Hello, World! Hello, World!

And the output is:

Paragraph: Paragraph: Hello, World!!! Hello, World!!!

What am i doing wrong?

Answer1:

^ by default represents start of the string, not start of the line. If you want to it to represent start of the line you need to add multiline flag to your regex (?m).

Also consider using look-ahead which in Java 8 will automatically get rid of first empty result in your split array.

So try with this regex:

private static final String PARAGRAPH_SPLIT_REGEX = "(?m)(?=^\\s{4})";

To get rid of unwanted separators like spaces or new lines at start or end of your string you can simply use trim method like

public static void parseText(String text) { String[] paragraphs = text.split(PARAGRAPH_SPLIT_REGEX); for (String paragraph : paragraphs) { System.out.println("Paragraph: " + paragraph.trim()); } }

Example:

String s = " Hello, World!\r\n" + " Hello, World!\r\n" + " Hello, World!"; parseText(s);

Output:

Paragraph: Hello, World! Paragraph: Hello, World! Paragraph: Hello, World! <hr>

Pre Java 8 version:

If you need to use this code on older versions of Java then you will need to prevent splitting at start of the string (to avoid getting first element empty). To do this you can use (?!^) before miltiline flag. This way ^ before (?m) can still be representing only start of string, not start of the line. Or to be more explicit you can use \A which represents start of String regardless of multiline flag.

So pre Java 8 version of regex can look like

private static final String PARAGRAPH_SPLIT_REGEX = "(?!^)(?m)(?=^\\s{4})";

or

private static final String PARAGRAPH_SPLIT_REGEX = "(?m)(?!\\A)(?=^\\s{4})";

Answer2:

Your regex should be \\s{4} without the ^ in the beginning.

人吐槽 人点赞

Recommend

Comment

用户名: 密码:
验证码: 匿名发表

你可以使用这些语言

查看评论:Splitting text into paragraphs with regex JAVA