Massimo Roberto Scamarcia PRO

mrs83

AI & ML interests

Natural Language Processing, Text Generation, Question Answering, Data Augmentation, Knowledge Transfer, Chain-of-Thought, ResearchOps, MLOps

Recent Activity

updated a model about 4 hours ago

ethicalabs/Kurtis-EON1

replied to their post about 4 hours ago

In 2017, my RNNs were babbling. Today, they are hallucinating beautifully. 10 years ago, getting an LSTM to output coherent English was a struggle. 10 years later, after a "cure" based on FineWeb-EDU and a custom synthetic mix for causal conversation, the results are fascinating. We trained this on ~10B tokens on a single AMD GPU (ROCm). It is not a Transformer: Echo-DSRN (400M) is a novel recurrent architecture inspired by Hymba, RWKV, and xLSTM, designed to challenge the "Attention is All You Need" monopoly on the Edge. The ambitious goal is to build a small instruct model with RAG and tool usage capabilities (https://huggingface.co/ethicalabs/Kurtis-EON1) 📊 The Benchmarks (Size: 400M) For a model this size (trained on <10B tokens), the specialized performance is surprising: *SciQ*: 73.8% 🦄 (This rivals billion-parameter models in pure fact retrieval). *PIQA*: 62.3% (Solid physical intuition for a sub-1B model). The Reality Check: HellaSwag (29.3%) and Winogrande (50.2%) show the limits of 400M parameters and 10B tokens training. We are hitting the "Reasoning Wall" which confirms we need to scale to (hopefully) unlock deeper common sense. As you can see in the visualization (to be released soon on HF), the FineWeb-EDU bias is strong. The model is convinced it is in a classroom ("In this course, we explore..."). The Instruct Model is not ready yet and we are currently using curriculum learning to test model plasticity. Source code and weights will not be released yet. This is not a fork or a fine-tune: the base model is built in-house at https://www.ethicalabs.ai/, with novel components that do not exist in current open libraries. 🤝 Call for Collaboration: I am looking for Peer Reviewers interested in recurrent/hybrid architectures. If you want to explore what lies beyond Transformers, let’s connect! Training diary: https://huggingface.co/ethicalabs/Kurtis-EON1

replied to their post about 4 hours ago

Hello HF community, I'm happy to share a project I've been working on that combines mlx-lm with Flower, to enable federated fine-tuning of SLMs (Small Language Models) on MacOS devices  GitHub Repo: https://github.com/ethicalabs-ai/BlossomTuneLLM-MLX By combining mlx-lm with a federated learning framework like Flower (https://flower.ai/), we can leverage the hardware people already own and reduce the reliance on expensive GPUs, enabling collaborative model training. This project is the MLX-native evolution of an earlier codebase for FlowerTune LLM: https://arxiv.org/abs/2506.02961 https://flower.ai/blog/2024-10-16-flowertune-llm-leaderboard https://github.com/ethicalabs-ai/BlossomTuneLLM How it works: Flower handles all the federated learning logic. A central server (superlink) coordinates the training rounds, client selection, and parameter aggregation. Each participant in the network runs a Flower client (supernode) on their Mac. In each round, the client: - Receives the global LoRA/DoRA adapter weights from the server. - Loads its local data partition. - It makes use of the mlx-lm programmatic API (mlx_lm.tuner.train) to perform LoRA/DoRA fine-tuning. - Sends only the updated adapter weights back to the server. The server only ever sees the aggregated model updates and private data never leaves the device. Flower made it easy to run a full simulation (with a centralized HF dataset, partitioned using flower-datasets) on a single machine or multiple machines, to test the whole process in action and experiment further. All you need is a single or multiple Mac machines with Apple Silicon 

View all activity

Organizations

updated a model about 4 hours ago

ethicalabs/Kurtis-EON1

Text Generation • Updated about 3 hours ago • 4

replied to their post about 4 hours ago

@maxxafits00 federated learning is definitely the path forward, and it’s something we’ve already begun experimenting with using the flower.ai framework.

Regarding the release, we are currently in mid-training and prioritizing a rigorous "safety-first" pipeline.

We are conducting extensive evaluations on model plasticity, red-teaming for prompt injection, and most importantly, stress-testing for malicious use cases.

We want to ensure the model is robust before it hits the wild.

The current roadmap includes:

Completing the Knowledge Expansion phase.
A comprehensive DPO (Direct Preference Optimization) pass to align the "Kurtis" persona and reasoning capabilities.
Peer review and final validation.

A quick technical spoiler:

The base model pre-training is pure PyTorch and fully multi-GPU compatible. We are utilizing a Curriculum Learning strategy: starting with a small context length and gradually scaling up. This is paired with an enormous batch size and small data chunks.

replied to their post about 4 hours ago

This comment has been hidden

liked 2 datasets 1 day ago

naufalso/smoltalk2_non_thinking

Viewer • Updated Sep 15, 2025 • 3.8M • 99 • 1

HuggingFaceTB/smoltalk

Viewer • Updated Feb 10, 2025 • 2.2M • 5.12k • 388

updated a model 1 day ago

ethicalabs/Echo-DSRN-002-Kurtis-EON1-Alpaca-PKUSafeRLHF-DailyDialog-Mix-PEFT

Text Generation • Updated 1 day ago • 11

published a model 1 day ago

ethicalabs/Echo-DSRN-002-Kurtis-EON1-Alpaca-PKUSafeRLHF-DailyDialog-Mix-PEFT

Text Generation • Updated 1 day ago • 11

updated a collection 1 day ago

Kurtis-EON1

Collection

Language Model • 2 items • Updated 1 day ago

updated a model 1 day ago

ethicalabs/Echo-DSRN-001-Ultrachat-PEFT

Text Generation • Updated 1 day ago • 99

updated a collection 1 day ago

Kurtis-EON1

Collection

Language Model • 2 items • Updated 1 day ago

replied to their post 1 day ago

@maxxafits00 if you are on a budget, I suggest to start small. I unfortunately don't have enough compute to scale right now. To evaluate a pretraining or distillation framework (such as arcee-ai's distillkit), or a new model architecture, you can start from datasets such as TinyStories and move to FineWeb-EDU, cosmopedia, etc later.

Wait for the training and architecture to be stable and validated before moving to a bigger dataset/model. Also, a 7-8B parameters is probably too big for small scale pre-training experiments.

You should try to target 0.5B, max 3B, especially if you use consumer-grade hardware, or a single GPU for rent.

replied to their post 2 days ago

interesting. Yes, as you noticed as well a few billions tokens aren't enough. SmolLM2 360M was trained on 4 trillion tokens.

but I am not sure how to explain those results on piqa and sciq:

uv run lm_eval --model hf   --model_args pretrained=models/Echo-DSRN-Small-Instruct-Kurtis,trust_remote_code=True,device_map="auto"   --tasks hellaswag,winogrande,piqa,sciq --output_path ./results_final

Tasks	Version	Filter	Metric		Value		Stderr
hellaswag	1	none	acc	↑	0.2927	±	0.0045
		none	acc_norm	↑	0.3199	±	0.0047
piqa	1	none	acc	↑	0.6230	±	0.0113
		none	acc_norm	↑	0.6202	±	0.0113
sciq	1	none	acc	↑	0.7380	±	0.0139
		none	acc_norm	↑	0.6480	±	0.0151
winogrande	1	none	acc	↑	0.5020	±	0.0141

I can share more details in this convo, but this probably uncharted territory for an hybrid RNN with 4 attention heads

replied to their post 2 days ago

Now available at https://huggingface.co/spaces/ethicalabs/Echo-DSRN-Small-Next-Word-Prediction ... on the shared CPU HF resources it runs slow, but on my Macbook M4 and AMD Strix Halo is blazing fast. Memory footprint is low. I am now expanding to 1B using Net2Net and today I tested a SFT run (QLoRA, 4-bit, bf16) on consumer hardware with trl with apparently no catastrophic forgetting.

published a Space 2 days ago

Echo DSRN Small Next Word Prediction

🔥

Echo-DSRN-Small (400m RNN) - Next Word Prediction

posted an update 2 days ago

Post

1978

In 2017, my RNNs were babbling. Today, they are hallucinating beautifully.

10 years ago, getting an LSTM to output coherent English was a struggle.
10 years later, after a "cure" based on FineWeb-EDU and a custom synthetic mix for causal conversation, the results are fascinating.

We trained this on ~10B tokens on a single AMD GPU (ROCm). It is not a Transformer: Echo-DSRN (400M) is a novel recurrent architecture inspired by Hymba, RWKV, and xLSTM, designed to challenge the "Attention is All You Need" monopoly on the Edge.

The ambitious goal is to build a small instruct model with RAG and tool usage capabilities ( ethicalabs/Kurtis-EON1)

📊 The Benchmarks (Size: 400M)

For a model this size (trained on <10B tokens), the specialized performance is surprising:

*SciQ*: 73.8% 🦄 (This rivals billion-parameter models in pure fact retrieval).
*PIQA*: 62.3% (Solid physical intuition for a sub-1B model).

The Reality Check:

HellaSwag (29.3%) and Winogrande (50.2%) show the limits of 400M parameters and 10B tokens training.

We are hitting the "Reasoning Wall" which confirms we need to scale to (hopefully) unlock deeper common sense. As you can see in the visualization (to be released soon on HF), the FineWeb-EDU bias is strong. The model is convinced it is in a classroom ("In this course, we explore...").

The Instruct Model is not ready yet and we are currently using curriculum learning to test model plasticity.

Source code and weights will not be released yet. This is not a fork or a fine-tune: the base model is built in-house at https://www.ethicalabs.ai/, with novel components that do not exist in current open libraries.

🤝 Call for Collaboration: I am looking for Peer Reviewers interested in recurrent/hybrid architectures. If you want to explore what lies beyond Transformers, let’s connect!

Training diary: ethicalabs/Kurtis-EON1