they are effectively the same thing, one just has more tunab...

npub12262qa4uhw7u8gdwlgmntqtv7aye8vdcmvszkqwgs0zchel6mz7s6cgrkj
hex
73654644dbb32a525757d2d20258431f3437af9301e8403fc097138e75e0a9adnevent
nevent1qqs8xe2xgndmx2jj2ata95sztpp37dph47fsr6zq8lqfwyuwwhs2ntgprpmhxue69uhhyetvv9ujuem4d36kwatvw5hx6mm9qgs99d9qw67th0wr5xh05de4s9k0wjvnkxudkgptq8yg83vtulad30gzxqcuvKind-1 (TextNote)
↳ 回复 事件不存在
4a5ed84f10926d05c515d29150b5a0166abd2b0990bc4fad4cfc83aba1ed92fd...
they are effectively the same thing, one just has more tunability.
a basic dot-product based classifier could be a classification head just by setting each output’s weights to the embed.
I think a gated MLP may work better though, and would allow a lower LR for the main weights that could reduce OOD shift. Compared to a direct one it would allow some nonlinearity and a higher intermediate representation
Could also intentionally bottleneck training by having LoRA instead of full FT on the base model (but usually not worth it when considering these are <1B param)
原始 JSON
{
"kind": 1,
"id": "73654644dbb32a525757d2d20258431f3437af9301e8403fc097138e75e0a9ad",
"pubkey": "52b4a076bcbbbdc3a1aefa3735816cf74993b1b8db202b01c883c58be7fad8bd",
"created_at": 1776977416,
"tags": [
[
"e",
"bb2b59a5ec8683b22037fe5353dc7511e026ae150c669c4fda770da727da2aee",
"wss://relay.primal.net/",
"root",
"5cdbde0a550fc046a38d75e9fc238094f96b0165b8297c4fa69134ae4ec80024"
],
[
"e",
"4a5ed84f10926d05c515d29150b5a0166abd2b0990bc4fad4cfc83aba1ed92fd",
"wss://nos.lol",
"reply",
"5cdbde0a550fc046a38d75e9fc238094f96b0165b8297c4fa69134ae4ec80024"
],
[
"p",
"5cdbde0a550fc046a38d75e9fc238094f96b0165b8297c4fa69134ae4ec80024"
]
],
"content": "they are effectively the same thing, one just has more tunability.\n\na basic dot-product based classifier could be a classification head just by setting each output’s weights to the embed.\n\nI think a gated MLP may work better though, and would allow a lower LR for the main weights that could reduce OOD shift. Compared to a direct one it would allow some nonlinearity and a higher intermediate representation\n\nCould also intentionally bottleneck training by having LoRA instead of full FT on the base model (but usually not worth it when considering these are \u003c1B param)",
"sig": "acde57f98dd4905ca175dd35191f9fce7db2fe0b4168f711bf704e1fc696ce6e1bd0d088560dea41dc0ff0de16feee9e46f6fb35ea7a1daf072e077852bb55a5"
}