Custom Parsers

Custom Parsers

Weave handles CSV, JSON, JSONL, YAML, TOML, INI, XML, and HTML out of the box with sensible defaults. This page covers both configuring the builtin parsers when the defaults don’t fit, and writing your own parsers from scratch for custom formats.

Configuring Builtin Parsers

When you need different behavior from a builtin format — a pipe-separated CSV, headerless data, pretty-printed JSON — use builder functions. Each builder returns a configured parser or formatter function that works as a drop-in replacement for format symbols in read() and write().

# Default behavior — format symbol
data = read("data.csv", :csv)

# Configured behavior — builder function
my_parser = csv_parser(separator: "|", headers: false)
data = read("data.psv", my_parser)

CSV

CSV has the most configuration options. Use csv_parser for reading and csv_formatter for writing.

csv_parser options:

Option Default Description
separator "," Field delimiter
headers true First row contains column names
quote "\"" Quote character for fields
escape "\"" Escape character within quoted fields
comment none Comment line prefix character
trim false Trim whitespace from fields
# Read a pipe-separated file with no headers
parser = csv_parser(separator: "|", headers: false)
rows = read("data.psv", parser)
# rows is a container of containers (list of lists — no headers means no keys)

# Write pipe-separated output
formatter = csv_formatter(separator: "|")
write("output.psv", data, formatter)
# Read a CSV with comment lines and trimmed fields
parser = csv_parser(comment: "#", trim: true)
data = read("messy.csv", parser)

JSON

json_parser() has no configuration options.

json_formatter controls output style:

Option Default Description
pretty false Pretty-print with indentation
indent 2 Spaces per indent level (when pretty is true)
# Pretty-print JSON output
formatter = json_formatter(pretty: true, indent: 4)
write("config.json", data, formatter)

YAML

yaml_parser() has no configuration options.

yaml_formatter controls indentation:

Option Default Description
indent 2 Spaces per indent level
formatter = yaml_formatter(indent: 4)
write("config.yaml", data, formatter)

TOML

toml_parser() and toml_formatter() have no configuration options. They exist for API consistency — the symbols map to these functions under the hood - but there’s nothing to configure here.

INI

ini_parser handles files with different comment and delimiter conventions:

Option Default Description
comment ";" and "#" Comment prefix characters
delimiter "=" Key-value delimiter

ini_formatter controls output format:

Option Default Description
delimiter "=" Key-value delimiter
# Parse an INI file that uses : instead of =
parser = ini_parser(delimiter: ":")
config = read("app.conf", parser)

# Write it back with the same convention
formatter = ini_formatter(delimiter: ":")
write("app.conf", config, formatter)

XML

xml_parser handles XML documents:

Option Default Description
attr_prefix "@" Prefix for attribute keys in the result
text_key "#text" Key for text content nodes
collapse_text true Collapse single text children to string values
trim_text true Trim whitespace from text nodes

xml_formatter controls XML output:

Option Default Description
attr_prefix "@" Prefix identifying attribute keys
text_key "#text" Key for text content
pretty true Pretty-print with indentation
indent 2 Spaces per indent level
root_name "root" Root element name (when data has no natural root)
declaration true Include XML declaration header
# Parse XML
parser = xml_parser()
doc = read("books.xml", parser)

# Write XML with custom formatting
formatter = xml_formatter(indent: 4, declaration: false)
write("output.xml", data, formatter)

HTML

html_parser is a forgiving parser that handles malformed HTML gracefully — missing closing tags, unquoted attributes, etc.

Option Default Description
attr_prefix "@" Prefix for attribute keys
text_key "#text" Key for text content
collapse_text true Collapse single text children
trim_text true Trim whitespace from text
parser = html_parser()
page = read("index.html", parser)
# Works even with messy, real-world HTML

Writing Your Own Parsers

For formats Weave doesn’t handle natively — log files, fixed-width data, proprietary formats — you can write your own parser function and pass it to read().

How It Works

The read() function accepts either a format symbol or a function:

# Built-in format
data = read("config.json", :json)

# Custom parser
data = read("data.log", my_parser)

When you pass a function, Weave calls it with the raw file contents and expects a Container back.

Your First Custom Parser

Parse a key-value log format separated by ---:

timestamp=2024-01-15T10:30:00
level=INFO
message=Server started
---
timestamp=2024-01-15T10:30:05
level=DEBUG
message=Connection accepted
fn parse_log_entries(raw_text) {
    entries = []
    blocks = raw_text.split("---")

    blocks *> ^(block) {
        entry = []
        split(trim(block), "\n") *> ^(line) {
            if line.len > 0 {
                parts = split(line, "=") *> trim
                entry[parts[0]] = parts[1]
            }
        }
        if entry.len > 0 { entries << entry }
    }

    entries
}

# Use it
logs = read("server.log", parse_log_entries)

logs *> ^(entry) {
    if entry[:level] == "ERROR" {
        puts("Error at " + entry[:timestamp] + ": " + entry[:message])
    }
}

Parsing Fixed-Width Data

Many legacy systems export fixed-width files:

John Smith       42  Engineer      75000
Alice Johnson    35  Manager       92000
Bob Williams     28  Developer     68000
fn parse_fixed_width(raw_text) {
    fields = [
        [start: 0,  length: 17, name: :name],
        [start: 17, length: 4,  name: :age],
        [start: 21, length: 14, name: :title],
        [start: 35, length: 6,  name: :salary]
    ]

    records = []
    split(raw_text, "\n") *> ^(line) {
        if line.len > 0 {
            record = []
            fields *> ^(f) {
                value = line.substr(f[:start], f[:length]).trim()
                if f[:name] == :age || f[:name] == :salary {
                    value = value.to_num()
                }
                record[f[:name]] = value
            }
            records << record
        }
    }
    records
}

employees = read("employees.dat", parse_fixed_width)

fn sum(v, acc: 0) { acc + v }
high_earners = employees *> ^(e) { if e[:salary] > 80000 { e } }
total = high_earners *> ^(e) { e[:salary] } &> sum

Parser Factories

In Weave, we return closures to build configurable functions for pipelines - and parsers:

fn make_delimited_parser(delimiter) {
    ^(raw_text) {
        lines = split(raw_text, "\n")
        headers = split(lines[0], delimiter) *> ^(h) { h.trim() }

        records = []
        i = 1
        while i < lines.len {
            if lines[i].len > 0 {
                values = split(lines[i], delimiter)
                record = []
                j = 0
                while j < headers.len {
                    record[headers[j]] = values[j].trim()
                    j += 1
                }
                records << record
            }
            i += 1
        }
        records
    }
}

# Create parsers for different delimiters
parse_pipe = make_delimited_parser("|")
parse_tab = make_delimited_parser("\t")
parse_semicolon = make_delimited_parser(";")

pipe_data = read("data.psv", parse_pipe)
tab_data = read("data.tsv", parse_tab)