Decision Workspace
tokenizer vs unicode-segmentation vs text-splitter
Side-by-side comparison of Rust crates
39
tokenizer
growingv0.1.2
Thai text tokenizer
69
unicode-segmentation
stablev1.13.2
This crate provides Grapheme Cluster, Word and Sentence boundaries according to Unicode Standard Annex #29 rules.
59
text-splitter
growingv0.29.3
Split text into semantic chunks, up to a desired chunk size. Supports calculating length by characters and tokens, and is callable from Rust and Python.
Core Metrics
| tokenizer | unicode-segmentation | text-splitter | |
|---|---|---|---|
| Health Score | 39 | 69 | 59 |
| Total Downloads | 3.9K | 335.5M | 1.1M |
| 30d Downloads | 48 | 23.8M | 113.6K |
| Dependents | 0 | 12.0K | 654 |
| Releases | 2 | 26 | 60 |
| Last Updated | 2126d ago | 1d ago | 87d ago |
| Age | 5y 10m | 10y 11m | 2y 10m |
Health Breakdown
tokenizer
Maintenance
3
Quality
17
Community
6
Popularity
4
Documentation
9
unicode-segmentation
Maintenance
17
Quality
17
Community
16
Popularity
8
Documentation
11
text-splitter
Maintenance
14
Quality
13
Community
13
Popularity
7
Documentation
12
Technical Details
| tokenizer | unicode-segmentation | text-splitter | |
|---|---|---|---|
| Version | 0.1.2 | 1.13.2 | 0.29.3 |
| Stable (≥1.0) | ✗ No | ✓ Yes | ✗ No |
| License | BSD-3-Clause | MIT OR Apache-2.0 | MIT |
| Dependencies | 2 | 3 | 21 |
| Crate Size | 17KB | 112KB | 59KB |
| Features | 3 | 1 | 4 |
| Yanked % | 0.0% | 7.7% | 1.7% |
| Edition | 2018 | 2018 | 2021 |
| MSRV | — | 1.85.0 | 1.83.0 |
| Owners | 1 | 6 | 1 |
Links
Quick Verdict
- •unicode-segmentation leads with a health score of 69/100, but none of the options score above 80.
- •unicode-segmentation is depended on by 12.0K crates — strongest ecosystem trust.
- •⚠ tokenizer has not been updated in over a year.
- •tokenizer, text-splitter are pre-1.0 — API may change.