Recursive Character Chunker
Overview
The Recursive Character Chunker provides more intelligent text splitting by using a hierarchy of delimiters. It attempts to split on the most meaningful delimiters first, falling back to less meaningful ones if needed.
Installation
<dependency>
<groupId>io.jchunk</groupId>
<artifactId>jchunk-recursive-character</artifactId>
<version>${jchunk.version}</version>
</dependency>
implementation group: 'io.jchunk', name: 'jchunk-recursive-character', version: "${JCHUNK_VERSION}"
Configuration
// using default config
RecursiveCharacterChunker chunker = new RecursiveCharacterChunker();
// with custom config
Config config = Config.builder()
.chunkSize(10)
.chunkOverlap(0)
.delimiters(List.of(";"))
.trimWhitespace(true)
.keepDelimiter(Delimiter.START)
.build();
RecursiveCharacterChunker chunker = new RecursiveCharacterChunker(config);
Configuration Options
chunkSize: Maximum number of characters per chunk. Chunks may exceed this limit if the text cannot be split further.- Default:
100.
- Default:
chunkOverlap: Number of characters to overlap between consecutive chunks. Helps preserve context across boundaries.- Default:
20.
- Default:
delimiters: Ordered list of regex strings used for splitting. The chunker tries them in sequence; if none match, it falls back to the last (""= character-level).- Default:
["\n\n", "\n", " ", ""].
- Default:
keepDelimiter: Whether to keep delimiters in chunks:NONE,START,END.- Default:
START.
- Default:
trimWhitespace: Whether to trim leading/trailing whitespace in each chunk.- Default:
true.
- Default:
Default Delimiters
The default separator hierarchy is:
\n\n(double newlines)\n(single newlines)\s(single space)- `` (empty string)
Example
Config config = Config.builder().chunkSize(15).chunkOverlap(0).build();
RecursiveCharacterChunker chunker = new RecursiveCharacterChunker(config);
List<Chunk> chunks = chunker.split("split this text\n\nI need it\n to be done as soon as possible");
// Result: ["split this text", "I need it", "to be done as", "soon as", "possible"]
Custom Separators
List<String> customSeparators = List.of("-", ".");
Config config = Config.builder()
.chunkSize(5)
.chunkOverlap(0)
.separators(List.of("-", "."))
.build();
RecursiveCharacterChunker chunker = new RecursiveCharacterChunker(config);
List<Chunk> chunks = chunker.split("");
// Result: ["give", "- me a", "_ hand"]
Pros and Cons
Pros
- Easy to implement and understand
- More advanced than fixed chunking
- Predictable chunk sizes
- Fast processing
Cons
- Doesn't consider context
- May produce larger chunks than specified
- Overlap creates duplicate data