Each line in the .jsonl file is a JSON object. The _source field indicates which module it came from.
// Text Submissions (pre-training)
{"text": "Naimbag a bigat...", "metadata": {"title": "...", "category": "Story"}, "_source": "submissions"}
// Parallel Sentences (translation)
{"ilokano": "Kumusta ka?", "english": "How are you?", "metadata": {"source": "..."}, "_source": "parallel"}
// Grammar Rules (instruction-tuning)
{"instruction": "Explain the Ilokano grammar rule: ...", "output": "...", "_source": "grammar"}
// Vocabulary (dictionary)
{"ilokano": "balay", "english": "house", "part_of_speech": "noun", "_source": "vocabulary"}
// Logic Entries (QA / dialog)
{"type": "qa", "question": "...", "answer": "...", "_source": "logic"}