Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D. M., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, A., Sutskever, I., and Amodei, D. (2020). Language models are few-shot learners. Advances in Neural Information Processing Systems, 33, pp. 1877-1901.
Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D., Steinhardt, J. (2021). Measuring massive multitask language understanding. International Conference on Learning Representations.
Hendrycks, D., Burns, C., Kadavath, S., Arora, A., Basart, S., Tang, E., Song, D., Steinhardt, J. (2021). Measuring mathematical problem solving with the MATH dataset. ArXiv preprint arXiv:2103.03874.
Jin, D., Pan, E., Oufattole, N., Weng, W., Fang, H., and Szolovits, P. (2021). What disease does this patient have? A largescale open domain question answering dataset from medical exams. Applied Sciences, 11(14), pp. 6421.
Li, Z., Peng, J., Yang, Z., Li, B., Zhang, Y., Zhang, X., Mi, F., Zhang, Y., Xu, J., Sun, M. (2023). On the evaluation of AI-generated texts. ArXiv preprint arXiv:2309.12288.
Lin, S., Hilton, J., Evans, O. (2022). TruthfulQA: Measuring how models mimic human falsehoods. Association for Computational Linguistics.
Nakamoto, S. (2008). Bitcoin: A peer-to-peer electronic cash system. Decentralized Business Review.
Pal, A., Umapathi, L. K., Sankarasubbu, M. (2022). MedMCQA: A large-scale multi-subject multi-choice dataset for medical domain question answering. ArXiv preprint arXiv:2203.14371.
Perez, E., Kiela, D., Cho, K. (2022). Discovering language model behaviors with model-written evaluations. ArXiv preprint arXiv:2212.09251.
Wang, A., Pruksachatkun, Y., Nangia, N., Singh, A., Michael, J., Hill, F., Levy, O., and Bowman, S. R. (2019). SuperGLUE: A stickier benchmark for general-purpose language understanding systems. Advances in Neural Information Processing Systems, 32.
Wang, A., Singh, A., Michael, J., Hill, F., Levy, O., Bowman, S. R. (2018). GLUE: A multi-task benchmark and analysis platform for natural language understanding. Proceedings of the 2018 EMNLP Workshop BlackboxNLP.
Wang, K., Guo, Q., Zhang, S., Zhou, X., Li, Y., Yuan, Q. (2023). Evaluating large language models on blockchain technical knowledge. ArXiv preprint arXiv:2306.12564.
Wang, Y., Kordi, Y., Mishra, S., Liu, A., Smith, N. A., Khashabi, D., Hajishirzi, H. (2023). Self-Instruct: Aligning Language Model with Self-Generated Instructions. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 13484-13508.
Wang, Y., Sun, T., Phan, M.C., Wang, Z., Hou, L., Song, X., Iyer, S., Li, X., Zamani, H. (2023). Large language models can self-improve. ArXiv preprint arXiv:2210.11610.
Xie, Z., Lin, G., Shi, Y., Li, Z., Wen, Q. (2023). FinBen: A benchmark for financial language understanding. ArXiv preprint arXiv:2312.09230.
Zheng, C., Xiong, L., Ding, Z., Li, T., Zhuang, X. (2023). Law-Bench: Benchmarking legal knowledge of large language models. ArXiv preprint arXiv:2309.16289.