No description
Find a file
2026-05-02 06:52:10 -03:00
benches/bench WIP reintroduce partials 2026-05-02 06:52:10 -03:00
src WIP reintroduce partials 2026-05-02 06:52:10 -03:00
.gitignore Add benchmark for json_tokenizer 2026-04-26 12:07:18 -03:00
Cargo.lock Add benchmark for json_tokenizer 2026-04-26 12:07:18 -03:00
Cargo.toml Add benchmark for json_tokenizer 2026-04-26 12:07:18 -03:00
README.md Produce tokens in batches 2026-05-02 06:19:28 -03:00
rust-toolchain Add unescape logic 2026-04-17 14:30:28 -03:00

JSON to Arrow converter

Improving json_tokenizer

Ideas:

  • make pos tracking optional
  • remove mem::replace() of state
  • read multiple tokens at once
  • reduce size of state

Original:

json_tokenizer          time:   [9.8973 ms 9.9411 ms 9.9988 ms]

Remove mem::replace():

json_tokenizer          time:   [9.6076 ms 9.6246 ms 9.6424 ms]

Reduce size of state:

json_tokenizer          time:   [9.4924 ms 9.5022 ms 9.5122 ms]

Remove pos from NextToken:

json_tokenizer          time:   [9.1810 ms 9.2101 ms 9.2467 ms]

Remove pos:

json_tokenizer          time:   [8.7537 ms 8.7737 ms 8.7955 ms]

Use bytes instead of row/column pos:

json_tokenizer          time:   [9.0958 ms 9.1167 ms 9.1391 ms]

Remove all partial:

json_tokenizer          time:   [8.3998 ms 8.4180 ms 8.4378 ms]

No utf8 check:

json_tokenizer          time:   [7.2458 ms 7.2634 ms 7.2824 ms]

Batches of 128:

json_tokenizer          time:   [7.1455 ms 7.1628 ms 7.1815 ms]