https://huggingface.co/datasets/TIGER-Lab/WebInstructSub
TIGER-Lab/WebInstructSub · Datasets at Hugging Face
The linear function that best aproximates #z=x sqrt(y)# at #(-7, 64)# is #z = -56 + 8(x+7) - 7/16(y-64) = 28 + 8x - 7/16y#. To get this result, we must first notice that #z# is a function of the two variables #x# and #y#. Let's write #z=f(x,y)#. So, the be
huggingface.co
QA 데이터셋인데 추론 능력을 좀 향상 시킨 모델이네요

https://tiger-ai-lab.github.io/MAmmoTH2/
SOCIAL MEDIA TITLE TAG
SOCIAL MEDIA DESCRIPTION TAG TAG
tiger-ai-lab.github.io
설명은 여기에
Answer가 생각보다 긴게 좀 단점이긴 한데 일단
https://huggingface.co/datasets/nvidia/Nemotron-Pretraining-SFT-v1
nvidia/Nemotron-Pretraining-SFT-v1 · Datasets at Hugging Face
You need to agree to share your contact information to access this dataset This repository is publicly accessible, but you have to accept the conditions to access its files and content. By clicking “Agree” I confirm I have read and agree to NVIDIA Data
huggingface.co
Nvidia SFT 데이터인데 허가가 너무 느려서.....
https://huggingface.co/datasets/allenai/ai2_arc
allenai/ai2_arc · Datasets at Hugging Face
{ "text": [ "g, kg, cg", "dL, L, mL", "ft, yd, mi", "N, J, W" ], "label": [ "A", "B", "C", "D" ] }
huggingface.co
QA인데 MCQA 라....
Subset엔 ARC-Challenge랑 ARC-Easy
Split엔 train, validation, test 존재

https://huggingface.co/datasets/yahma/alpaca-cleaned
yahma/alpaca-cleaned · Datasets at Hugging Face
King Charles II was the monarch of England, Scotland, and Ireland from 1660 to 1685. Born on May 29, 1630, he was the son of King Charles I and Queen Henrietta Maria. During his father's reign, England was embroiled in civil war between the royalists who s
huggingface.co
SFT 데이터
알파카 데이터 셋에서 정제한 데이터 train만 존재
input 빈 것만 사용

https://huggingface.co/datasets/databricks/databricks-dolly-15k
databricks/databricks-dolly-15k · Datasets at Hugging Face
Bell Laboratories began experimenting with a range of recording techniques in the early 1930s. Performances by Leopold Stokowski and the Philadelphia Orchestra were recorded in 1931 and 1932 using telephone lines between the Academy of Music in Philadelphi
huggingface.co
Context가 빈 것만 사용하면 될듯

https://huggingface.co/datasets/nvidia/ChatQA-Training-Data
nvidia/ChatQA-Training-Data · Datasets at Hugging Face
The French king, John II, had been held captive in England. The Treaty of Brétigny set his ransom at 3 million crowns and allowed for hostages to be held in lieu of John. The hostages included two of his sons, several princes and nobles, four inhabitant
huggingface.co
Subset sft 만 쓰면 될 듯
근데 파싱이 되어 있어서 그건 잘 써야 할듯

https://huggingface.co/datasets/rajpurkar/squad
rajpurkar/squad · Datasets at Hugging Face
{ "text": [ "Father Joseph Carrier, C.S.C." ], "answer_start": [ 0 ] }
huggingface.co
Context
Question 형식으로 만들어서 쓸 수 있겠는데
음...

https://huggingface.co/datasets/tau/commonsense_qa
tau/commonsense_qa · Datasets at Hugging Face
{ "label": [ "A", "B", "C", "D", "E" ], "text": [ "television", "attic", "corner", "they cannot clean corner and library during football match they cannot need that", "ground" ] }
huggingface.co
split 에 validation이랑 test 있으니까 evaluation 까지 가능할 듯

법
https://huggingface.co/datasets/dzunggg/legal-qa-v1
dzunggg/legal-qa-v1 · Datasets at Hugging Face
Q: Hello, (Ref: Maritime PIRACY law vs. Civil/Criminal Law). Recently, in San Fran Bay Area, the Oakland/Alameda Island Estuary (a salt water navigable U.S. Waterway), has had incidents of "so-called" PIRACY. (Boats are being burglarized). THE QUESTION: Wh
huggingface.co
법쪽 QA 데이터 셋 인데 Question 쪽이 엄청 긴 것이 있어서 그것 쫌 해결해줘야 할 듯

메디컬
'인공지능 > 자연어 처리' 카테고리의 다른 글
| 허깅페이스 4 기초 - Trainer (2) | 2025.11.16 |
|---|---|
| 허깅페이스 3 기초 - Audio Feature Extractors (0) | 2025.11.15 |
| 허깅페이스 1 기초 - 모델 부르기, 모델 공유하기, 모델 구성 요소 변경하기 (0) | 2025.11.12 |
| Embedding 모델 학습하기 - Sentence Transformer Trainer (0) | 2025.09.07 |
| LLM Pruning to Encoder - Large Language Models Are Overparameterized Text Encoders (2) | 2025.07.17 |