they are effectively the same thing, one just has more tunab...

npub12262qa4uhw7u8gdwlgmntqtv7aye8vdcmvszkqwgs0zchel6mz7s6cgrkj

hex

73654644dbb32a525757d2d20258431f3437af9301e8403fc097138e75e0a9ad

nevent

nevent1qqs8xe2xgndmx2jj2ata95sztpp37dph47fsr6zq8lqfwyuwwhs2ntgprpmhxue69uhhyetvv9ujuem4d36kwatvw5hx6mm9qgs99d9qw67th0wr5xh05de4s9k0wjvnkxudkgptq8yg83vtulad30gzxqcuv

Kind-1 (TextNote)

2026-04-23T20:50:16Z

they are effectively the same thing, one just has more tunability.

a basic dot-product based classifier could be a classification head just by setting each output’s weights to the embed.

I think a gated MLP may work better though, and would allow a lower LR for the main weights that could reduce OOD shift. Compared to a direct one it would allow some nonlinearity and a higher intermediate representation

Could also intentionally bottleneck training by having LoRA instead of full FT on the base model (but usually not worth it when considering these are <1B param)

{ "kind": 1, "id": "73654644dbb32a525757d2d20258431f3437af9301e8403fc097138e75e0a9ad", "pubkey": "52b4a076bcbbbdc3a1aefa3735816cf74993b1b8db202b01c883c58be7fad8bd", "created_at": 1776977416, "tags": [ [ "e", "bb2b59a5ec8683b22037fe5353dc7511e026ae150c669c4fda770da727da2aee", "wss://relay.primal.net/", "root", "5cdbde0a550fc046a38d75e9fc238094f96b0165b8297c4fa69134ae4ec80024" ], [ "e", "4a5ed84f10926d05c515d29150b5a0166abd2b0990bc4fad4cfc83aba1ed92fd", "wss://nos.lol", "reply", "5cdbde0a550fc046a38d75e9fc238094f96b0165b8297c4fa69134ae4ec80024" ], [ "p", "5cdbde0a550fc046a38d75e9fc238094f96b0165b8297c4fa69134ae4ec80024" ] ], "content": "they are effectively the same thing, one just has more tunability.\n\na basic dot-product based classifier could be a classification head just by setting each output’s weights to the embed.\n\nI think a gated MLP may work better though, and would allow a lower LR for the main weights that could reduce OOD shift. Compared to a direct one it would allow some nonlinearity and a higher intermediate representation\n\nCould also intentionally bottleneck training by having LoRA instead of full FT on the base model (but usually not worth it when considering these are \u003c1B param)", "sig": "acde57f98dd4905ca175dd35191f9fce7db2fe0b4168f711bf704e1fc696ce6e1bd0d088560dea41dc0ff0de16feee9e46f6fb35ea7a1daf072e077852bb55a5" }

they are effectively the same thing, one just has more tunab...

原始 JSON