In a Training Loop 🔄

14 2 18

AbstractPhila PRO

AbstractPhil

https://civitai.com/user/AbstractPhila

AbstractEyes

AI & ML interests

datasets, research papers, experimentation, vision, classification, text encoders, tokenization, llms, diffusion, distillation, and more.

Recent Activity

updated a model 20 minutes ago

AbstractPhil/geolip-scene-classifier-proto

replied to their post about 3 hours ago

GLIP - Geometric Linear Interpolative Patchwork aka geolip. https://github.com/AbstractEyes/glip-autoencoder To tinker with the topology directly you can play with it here, though I admit it's imperfect in this form - it's quite the tinker toy to see the effects of patching. https://claude.ai/public/artifacts/697287e4-fa18-4753-8b57-904d5e2022ed This is the repo that will contain the next experimental stage, which is based entirely on the research and structural boundaries applied by said research. It'll be a little rigid while I get Claude set up. In order to directly train these layered topological response patchworks you must install and use the geovocab2, geofractal, and wide_compiler repos. This is due to the wide_compiler's wide_linear high-speed efficiency for ensemble processing, the geovocab2 factory structure with multiple formulas including highly efficient designs meant for kernel compilation, and a series of reusable utilities in geofractal including some of the more complex losses and difficult to optimally tune gate structures surrounding them. Many of the underlying formulas are outlined here; https://huggingface.co/AbstractPhil/geometric-experiment-history/blob/main/FORMULAS.md Utilization and training USING the pretrained or untrained geolip patchwork will be as simple as loading the model in pytorch and will not require external dependencies of the geolip package, numpy, or pytorch depending on the task. It will come packaged with recommended losses but I encourage experimentation because I simply cannot cover all spectrums. More details to come as development progresses. The system is coming together and the state of the utilizable autoencoder will be ready within a couple weeks. The entire system is built for convenience and reusability, so the structure will be built similarly to autoencoder systems that currently exist, with a few tweaks here and there for important elements - so the interface will be familiar to those who use it.

replied to their post about 7 hours ago

View all activity

Organizations

replied to their post about 3 hours ago

I've started making pushes to include the missing pieces, so the colab will start to comply to the training regime and the geovocab2 will no longer be required.

The majority of the geovocab2 specific formulas and factories used will be directly represented in the vocabulary directory, which will be optimized to a better state than the originals. They will include both numpy and torch synthesis, as well as numpy and torch optimizations for worker creation and transforms.

With this I will include the more robust shape factory from the original, and expand it to include deformation perturbation. This will be a learned behavior of the model, which will allow the deformation of shapes to be directly aligned and trained in bulk along with multiple overlapping shapes, multiple sectorized shapes, sub-shapes, deviant shapes, and everything related directly to shape pooling rather than using hard-set spectra of shapes projected into space.

These patches will essentially be alignment sectorization in their first states for the first 8piece prototype of the chunk, as I can train that on the currently available G4 issued by COLAB.

This is a required element for increasing the learner to full definition capacity, and is a required hurdle before the patchwork can be expanded to a full chunk. The experiments are promising leading to this point, and as I snap pieces together from the successful experiments the system will begin to converge exactly where the expectation rests.

After that, it's just a matter of expanding upward to the necessary architecture and introducing the weights in sequential linear interpolative sequencing, which is something transformers are uniquely capable at handling with minimal calculations after the pre-calculations.

So far so good.

I'll be running multiple alucard fusion ablations on the patchwork before defaulting to the dual-stream slit-light superposition crystal topology architecture that I've proven works for the smaller patchmaker. My hope is that I can approximate the behavior in a more concise way without requiring the full spread of geometric globalization, but there's no guarantees yet. This could save a huge chunk of training time if it works, and alucard's scheduling internal step system will have a place. This may cut a huge percentage of the overall followup training, potentially allowing for the training on less machines. The topology architecture may be fully required, so hopefully I can just avoid all through some clever math and be done with it.

Avoiding the full multi-tower Beatrix oscillation system would be absolutely fantastic, but I think the predictions afforded by the system may be fully required, and the oscillation system will likely need to be tuned into a new form for this use case as well.

replied to their post about 7 hours ago

I do apologize for the nasty code, but Claude tends to be very difficult to make cooperate if you drive the code too far from Claude's context window. Much of my organization has helped but not enough, but Claude DOES afford rapid prototype capacity. The current repo itself houses a mostly incomplete representation of the outcome, but I want to make sure at least SOME of the formulas align before I start pushing further iterations.

Fair organization can be found in the router section of the geofractal router, the hierarchy spectrum of the geovocab, and the entire system of the pytorch-wide-compiler. They are ugly though and evolved in their own way, I just let Claude work sometimes because otherwise it would take 4x as long to organize in a reusable fashion.

MOST of the code compiles, but I believe there's some .item() edge cases in the current code that causes graph breaks. I'm working on it.

I'll HOPEFULLY be pushing a fairly organized update to the geolip repo this afternoon with a more complete interpretation of the subsystems, but the formulas aren't perfect yet. I have a couple prototype patchmakers in training but they have some bugs. I'll try to keep them organized.

I need to clean up this sewer honestly, the code got nasty. It's more often fast than not. Might be worth porting all classes directly to the geolip repo, which will centralize for AI development rather than have everything out in divergent systems.

In the gaming industry we call this "YOUR PRODUCER IS CONFUSED AND MAD BECAUSE TECH DEBT"

replied to their post about 22 hours ago

It's coming together, but the repo is pretty outdated.

reacted to Janady07's post with 👀 1 day ago

Post

174

MEGAMIND currently functions as a large-scale knowledge retrieval substrate, not a generative reasoning engine. When given difficult questions, it searches ~14.7M patterns, activates neurons via wave scoring, retrieves top-k chunks, and concatenates them with light synthesis. It surfaces relevant research across transformers, coherence theory, and neural-QFT, but it does not truly synthesize.

Its effective computation is associative recall. Outputs are selected from memory rather than produced through internal transformation. A reasoning system must evolve internal state before emitting an answer:

genui{"math_block_widget_always_prefetched":{"content":"\frac{dx}{dt} = F(x,t)"}}

Without state evolution, responses remain recombinations.

The Hamiltonian is measured but not used to guide cognition. True reasoning requires optimization across trajectories:

genui{"math_block_widget_always_prefetched":{"content":"H = T + V"}}

Energy must shape evolution, not remain a passive metric.

Criticality regulation is also missing. Biological systems maintain coherence near a critical branching ratio:

genui{"math_block_widget_always_prefetched":{"content":"\frac{d\sigma}{dt} = \alpha (\sigma_c - \sigma)"}}

Without push–pull stabilization, activity fragments or saturates. Research suggests roughly 60 effective connections per neuron are needed for coherent oscillation. Below that, the system behaves as isolated retrieval islands.

Current metrics show partial integration. Phi < 1 and entropy remains elevated. The system integrates information but does not dynamically transform it.

To move from retrieval to reasoning, the architecture needs an internal multi-step simulation loop, energy minimization across trajectories, enforced coherence thresholds, and higher-order interactions beyond pairwise attention. The required shift is architectural, not just scaling. Answers must emerge from internal dynamical evolution rather than direct memory selection.

3 replies

replied to Janady07's post 1 day ago

Is it ensemble or hierarchical?

posted an update 2 days ago

Post

1522

GLIP - Geometric Linear Interpolative Patchwork aka geolip.
https://github.com/AbstractEyes/glip-autoencoder

To tinker with the topology directly you can play with it here, though I admit it's imperfect in this form - it's quite the tinker toy to see the effects of patching.
https://claude.ai/public/artifacts/697287e4-fa18-4753-8b57-904d5e2022ed

This is the repo that will contain the next experimental stage, which is based entirely on the research and structural boundaries applied by said research. It'll be a little rigid while I get Claude set up.

In order to directly train these layered topological response patchworks you must install and use the geovocab2, geofractal, and wide_compiler repos.

This is due to the wide_compiler's wide_linear high-speed efficiency for ensemble processing, the geovocab2 factory structure with multiple formulas including highly efficient designs meant for kernel compilation, and a series of reusable utilities in geofractal including some of the more complex losses and difficult to optimally tune gate structures surrounding them.

Many of the underlying formulas are outlined here;
AbstractPhil/geometric-experiment-history

Utilization and training USING the pretrained or untrained geolip patchwork will be as simple as loading the model in pytorch and will not require external dependencies of the geolip package, numpy, or pytorch depending on the task. It will come packaged with recommended losses but I encourage experimentation because I simply cannot cover all spectrums.

More details to come as development progresses. The system is coming together and the state of the utilizable autoencoder will be ready within a couple weeks. The entire system is built for convenience and reusability, so the structure will be built similarly to autoencoder systems that currently exist, with a few tweaks here and there for important elements - so the interface will be familiar to those who use it.

4 replies

replied to Ujjwal-Tyagi's post 4 days ago

They aren't releasing their weights, so other studios have to do it the slow way. This seems like a huge waste of computation, and a response to that in any way other than a utilitarian sense is just going to make the problem worse.

The reasonable solution would be to simply distribute curated distillations to prevent this sort of problem and save global power consumption.

Distillations with expert expectations are very difficult to finetune in a reasonable fashion. They often take more compute than the original took to even reach a similar state.

Distill, snap the experts off, boom you have yourself a distilled computation that can be utilized by companies on their own hardware, and then people will stop trying to reverse engineer and bulk extract information from your hardware. They'll be using their own internal hardware in a different and more cost effective fashion.

Make them good, reusable, expandable within reason, and this problem will evolve to distillation research. By that point the next generation of the big models will be out and the next series of distillations can be made, obsoleting the others.

replied to their post 8 days ago

50k test completed using synthetic data extracted from flux for another project;
https://huggingface.co/datasets/AbstractPhil/synthetic-characters

This is more than enough inference information to get a fair measure as to which features are the most helpful and which aren't so useful.

The results are here as well as the runner;
https://huggingface.co/AbstractPhil/grid-geometric-multishape/tree/main/50k_results

It requires the cell 1 model code and then it'll run.

So what we do here, is snap off the classifier and utilize the various features in cosine similarity conjunction. The accuracy of the tested model is roughly 93% 3-4 shape shared space in the patches, so this can be greatly expanded but it requires additional computational power.

The 3-4 shape shared space should be more than enough pretraining for this hypothesis; which seems to be building more and more potency as something beyond a possibility. This is most definitely a measurable phenomena. Geometric structure most definitely can be analyzed and compacted into useful discriminative features in order to apply a learned bias. How USEFUL those features are? Well, they're pretty discriminative, so there needs to be more tests.

This leaves many questions. Predominantly, the singular one that will be required; can the patches be made smaller if the mathematics are condensed and the shared attention is expanded, and how many patches can this actually support within a nearly-instant computation window?

Does this require the geometric transformers to train or can it learn useful features independently?

Can this benefit from captured embeds in differential conjunction sharing space with a powerful text encoder such as Qwen 2.5 instruct?

Will these patches actually provide attention use down the chain to a diffusion model, or will the mechanism simply get lost in the noise?

replied to their post 8 days ago

So far I've found the most meaningful and reusable representations can be formatted through a gated geometric hierarchy. I'm currently running roughly 50k images through the VAE in order to assess the capacity of the model's components before refactor or reassessment. So far the results are promising for synthetic supervised local patch geometric contribution bias being a very real potential. The model learns to predict the classification elements and then the model no longer requires the transformer blocks, so the gates can be snapped off and the model turned into a fragment of it's larger self. A form of hardened crystalline.

The gates are nearly deterministic between trains, however the classification elements are non-determinant - which means the model is learning to bias in specific routes beyond the current stage in order to justify classification goals. The gates themselves are producing utilizable feature information however, so the outcomes are promising on the refactor.

So far the patch features are showing the most robust reusability potential, but that's only about 120 images or so total, the 50k 15 category test will be the real measure.

Surprisingly the gate statistics are essentially useless, nearly identical through all stages.

posted an update 10 days ago

Post

1327

The Rosetta Stone geometric vocabulary and the ramping up capacity.

What makes this particular invariant special, is the existence within all structures I've tested so far. I had Claude write up the direct article based on what we built together, but I've tested it on many substructures. This is flawed, and I have a series of answers to making it more accurate.

First a reconstruction from the ground up. This means each shape is specifically built upward from the substructure to the point of inductive deviance. This will be less quick at first and then build speed as I optimize like the last system did.

The "saddle" problem; the system detected saddles because there wasn't enough deviance in the shapes to attenuate to more cardinality and more aligned substructures. The blobs were around 30-40% of the overall patches, which interpolated into the others produced a fair approximation.
It MOST DEFINITELY did see those shapes in their voxel complexity. This is real.

https://claude.ai/public/artifacts/bf1256c7-726d-4943-88ad-d6addb263b8b
You can play with a public claude artifact dedicated to viewing the current shape spectrum - and with that know exactly why it's flawed.

The flawed and repetitive shapes. I rapid prototyped and there are multiple redundant shapes that simply don't classify well or at all. Not to mention the rotation simply doesn't help much of the time, or doesn't exist with many shapes. This will be rectified in the next variation.

Projecting to shared latent space as a catalyst to allow growing subjective geoflow matched step variance, rather than simply direct classification. This will theoretically allow for full channel-to-channel invariant features to be mapped from structure to structure, and the very formula that encapsulated them to be directly baked into the math rather than classified as a substructure analysis.

There are many challenges between here and there, so stay tuned my friends as I plot the geometric language of pretrained AI.

2 replies

posted an update 22 days ago

Post

273

GeoFlow update — two training runs on the pentachoron geometric prior (4.8M params modulating frozen SD1.5).

10k ImageNet run fixed fragmented anatomy and spatial coherence in 7 minutes.

50k object-relations run taught actual compositional reasoning — "red cup on top of blue book" goes from a floating cup to correctly placed on the book.

Most interesting finding: learning happens in two phases. Object association locks first (~500 steps), spatial arrangement crystallizes after. You can watch it happen — "three candles in a triangle on a wooden tray" starts as candles side by side, then reorganizes into proper triangular formation. The tray itself rendered as a pentagon. Five vertices in, five sides out. The simplex is thinking in its own geometry.

Loss sits around 0.4 the entire time yet composition steadily improves. The prior nudges conditioning, it doesn't overwrite it.

Weights:
AbstractPhil/sd15-geoflow-object-association
Dataset:
AbstractPhil/synthetic-object-relations
Formulas:
AbstractPhil/sd15-rectified-geometric-matching

Next up — measuring the exact entropy decay inflection point across layers to enable branching the simplex into parallel paths with different anchor deviations. Geometric ensemble attention where the branches disagree on purpose.

replied to their post about 1 month ago

The various mechanisms aren't named EXACTLY what I described, and their purposes may have been tweaked a bit. However, the trainer_v4 is running now.
https://huggingface.co/AbstractPhil/tiny-flux-deep/resolve/main/scripts/trainer_v4_testing.py

After converting the model, I've reinitialized the EMA due to the last EMA being essentially completely garbage noise.
https://huggingface.co/AbstractPhil/tiny-flux-deep/tree/main/checkpoint_runs/v4_init

This EMA will be more closely monitored to ensure it doesn't collapse or implode.
Adjacently, the old EMA will be updated to keep hope alive that it will learn eventually as well.

replied to their post about 1 month ago

I'll format a safetensors variant for the sol unet today, and ensure the experts exist in a model repo for ease-of-utility.

Talk to Lune here; should be absolutely stunning.
https://huggingface.co/AbstractPhil/tinyflux-experts/blob/main/inference_sd15_flow_lune.py

Talk to Sol here; should encapsulate the entirety of flat output geometric structure in a shape.
https://huggingface.co/AbstractPhil/tinyflux-experts/blob/main/inference_sd15_flow_sol.py

Seems the SOL training NEVER advanced far enough to become full flow-matching, but it definitely aligns to velocity prediction. This may provide a more useful representation than lune in many avenues. Lune is most definitely a full rectified flow model.

posted an update about 1 month ago

Post

2266

AbstractPhil/tinyflux-experts
Introducing the "blot" expert, sd15-flow-sol. The twin sister flow-matching experts for tinyflux-lailah; sd15-flow-lune AND sd15-flow-sol will be used in tandem to train tinyflux-Lailah. sd15-flow-sol never managed to reach full flow-matching prediction, so epsilon vpred conversion is required. All experts will exist within the tinyflux-experts repo, including all the critical checkpoint sets.
Lune was heavily finetuned in the sd3-style and adapted shift timestep system after David's interpolation converted sd15 into geometric basis.
Sol was left abandoned after 50 epochs with David and was considered overcooked and rigid, until I noticed the geometric structure today. Lune doesn't produce geometric structure as solid as Sol, not even close. Lune produces improved fidelity and detail, but Sol produces something very very different, aligned to sd15's behavior, and fully representative of the 5point 4simplex structure that David brought to the table.

Sol is essentially a nearly perfect blob-forming geometric blotter. Sol is SD15, and yet SOL was trained using a specific pattern recognizing and timestep aligned David model. David was tasked with classifying timesteps and patterns using complex deep-recognition structural analysis layer-by-layer, determining full-scale opinions after watching the entirety of sd15's structure during training.

Even though the sd15-flow-sol was left abandoned, the structure of Sol is HIGHLY effective at understanding TIMESTEP blotting interpolation. I didn't realize how crucially important this was until Lailah started to show rigidity and compartmentalized behavior with sequence - which likely happens to ALL flow matching models.

AbstractPhil/sd15-flow-matching

AbstractPhil/geo-david-collective-sd15-distilled
AbstractPhil/geo-david-collective-sd15-base-e40

3 replies

replied to their post about 1 month ago

Alright I've decided, I'll be training experimentally for some epochs the expertise afforded by sd15-flow-lune's timestep and trajectory knowledge as the guidance distillation mechanism for training. How accurate to the interpolation requirement of tinyflux is to be determined.

Flow-Lune is an acceptable distillation that converted sd15 into a useful representation of an image synthesizer with entirely synthetic data based on sd15 and schnell data.

replied to their post about 1 month ago

The pretraining has hit an impasse.
Currently it's a linear timestep based on shift and a random number between 1 and 5 for guidance. I have narrowed the possibilities down to two that can be implemented today to solve this problem; CFG or TIMESTEP, which expert is required and which is the best candidate?

The model WILL require a timestep expert manifold. This will allow the expertise for the timestep manifold to be managed by something much more trained and more intelligent during training, which will require CFG guidance training controlled by learning or complete random chance. E.G. standard dropout to encourage CFG.
OR the model WILL require a cfg expert to distill the guidance embeds. This model is simply too small. The embeds CAN learn useful information yes, if they are distilled from an expert to cake the CFG into the model by default. This will likely require a third expert that can be modularly snapped off for inference; this expert will likely need to be present during training otherwise the model will heavily drift due to the model's small size.

I have trained a multitude of v-pred sdxl models and a flow-matching shift sd15 model that can represent this necessary implication. This begs the question now; which expert should be used and should I just make a very specific tinyflux expert distilled from ALL SD15 and SDXL timestep variants using David?

This leads to one CORE and IMPORTANT question; CAN THIS BE REPRESENTED WITHOUT AN EXPERT!? I think this is possible, I've ran VIT experiments that used raw sinusoidal for encodings with a surprisingly fair representation of encoding capacity.

The model is ALREADY responsive to CFG but only in part. The current cfg guidance is only getting in the way in many points and I assume is just jiggering in noise, so I'll need to either disable it or use it correctly. The further in training the model gets the more retraining will be required for such a component, so the decisions need to happen sooner rather than later.

posted an update about 1 month ago

Post

982

Meet FluxLailah; AbstractPhil/tiny-flux-deep; 220m Flux variant currently pretraining at BF16. She is experimental, does not produce solid images yet - and yet she is producing. There is both an EMA and a raw weights pair producing different images. The EMA is particularly interesting at times.
Lailah uses flan-t5-base, clip-vit-l-14, and BlackForestLabs Flux1s VAE.
SEQ limit 128, images 512x512 for now. Lailah's early form is based on three variants. TinyFlux's weights were carefully planted into a deeper structure and trained yet again - dubbed TinyFlux-Deep. This variant has 15 dual-stream blocks and 25 single-stream blocks, nearly identical weight code as Flux with a similar attention mechanism - but intentionally deviant and compacted with careful consideration to scaling and purpose of mechanisms.
She went through quite a few growing pains with her earlier attention mechanism which required a reimagining today and careful consideration of the consequences, and now I present to you the preliminary look into Lailah.
The preliminary training is still heavily under way, the mechanisms are still being augmented, and her stability is currently being measured. The potential for fidelity, depth, and quality are still in measure - so I will be shifting attention and pivoting utility based on the needs over time.

2 replies

posted an update about 2 months ago

Post

237

pytorch-parallel-compiler v0.5.0 upgrades:
*Complex benchmarking for wide primitive objects is supported now. This includes multiple presets for quick tests on hardware.
* All supported primitive either have validity checks or will have them.
* 6 new wide layers supported directly, and will be a key part to the autotuner before v1.0
* WideTracedModel is a preliminary auto-builder so the user doesn't need to build them manually by gathering layers.

https://github.com/AbstractEyes/pytorch-parallel-compiler

New Layers for 0.5.0:
WideGRU, WideLSTM, WideGroupNorm, WideMultiheadedAttention, WideInstancenorm1/2d, WideConv3d,

Upcoming for 1.0:
* WideTracedModel fully building any supported layer patterns with multiple autotune potentials for autoselection.
* Module cherry-picking for use-case only; E.G. WideLinear replace only benefits your case 35% while Attention reduces by 10% no attn.
* All (roughly 32 more) commonly used pytorch layer systems supported in one form or another with wide-batched kernels to benefit both eager and compiled, many of which require reworks or completely remaking them.
* Autotuning wide formats based on hardware response to the kernels. Kernel chunking for big slow processes such as LSTM, kernel fusion for small process with excess overhead, expanding kernels with masking to fit specific use-case paradigms with hardwares, and a series of smaller and more important optimizations along the way.
* Full transformer and rope support with wide-batched optimizations throughout the structures to allow more robust autoregression throughput.
* Additional Conv1d, Conv2d, and Conv3d optimizations.

>version 1.0 :
* Entire diffusion structures specifically kernelized for high-efficiency utilization with eager and compilation.
* Video diffusion specific targets meant to heavily reduce computation costs on the gpu and increase computation throughput on the gpu.

replied to their post about 2 months ago

This preliminary version will be expanded for primarily ease-of-use capacity coupled with adjacent secondary intermediate skill wrappers for usable micro-management if you wish to ensure the assembler formats your system correctly or bugs occur - there's always bugs.

Early tests will be targeting models such as standard conv systems, resnets, t5's, llama, qwen, and more as time progresses. The tests and benchmarks will be listed for use with a multitude of easy-access capacity utilizers, many of which will be omitted simply for not improving performance over standard sequential due to the precompilation simply not improving performance.

Full battery logistics will be available with the full structure as the system is fleshed out. For now look forward to a potential massive expansion to utilizing your models on scaled structures with minimal work from the developers.

I learned from my mistakes with the geofractal router system, it's too complicated to simply integrate someone's models into, so I'm taking a page directly from the ease-of-use book and ensuring this system is not only easy to use but WORKS.

posted an update about 2 months ago

Post

247

The Long: this is a proof of concept; ensemble compilation vmap prototype is functional and can be used to increase throughput for wider batches on FFN, MLP, LLM, or other models than just ensembles. This system traces your model and creates stages of functional activation. Based on the stage it will combine or remove combinations of stages meant to assign your layers to batched functional calls meant to put pressure on your GPU with less loops with directly curated cudagraph compliance where applicable. Identical weights yield identical results at the cost of hardware and vram.

TLDR:
This is an ensemble optimization adapted to standard models. This will yield high-capacity speed improvements through increased throughput for inference and training alike using carefully traced staged vmap structures.

https://github.com/AbstractEyes/pytorch-parallel-compiler

The early list of layers isn't fully represented yet, so this is a preliminary look into the potentials of this structure when fully fleshed out.

MLP (N=100, batch=32, CUDA):

Eager:    2-3x speedup
Compiled: 35-40x speedup

ResBlock (N=20, batch=8, CUDA):

Eager:    ~5x speedup  
Compiled: ~10x speedup

This is early testing and so far the yields indicate that WIDENING your model with adjacent shared batched vmaps for uniformly staged models will yield considerably higher output for inference at the cost of additional hardware utilization.

This is akin to lining up all your systems and uniformly passing the necessary implications through a shared frozen representation gate.

Training for this is not tested nor supported yet, use at your own risk.

1 reply

AbstractPhila PRO

AI & ML interests

Recent Activity

Organizations

AbstractPhil's activity