Keyword Extraction of Scientific Journal Abstracts Using TF-IDF and KeyBERT Methods

Authors

  • Rayvin Suhartoyo Universitas Multi Data Palembang
  • Valen Julyo Multi Data Palembang University
  • Hafiz Irsyad Multi Data Palembang University
  • Abdul Rahman Multi Data Palembang University
https://doi.org/10.58466/aicoms.v4i2.1814

Keywords:

Cosine Similarity, Information Retrieval, Keyword Extraction, KeyBERT, Semantic Similarity, TF-IDF

Abstract

Keyword extraction is a significant technique in natural language processing (NLP) that serves to summarize the essence of a document, such as a scientific journal summary. This study aims to analyze the effectiveness of two keyword extraction methods, namely Term Frequency-Inverse Document Frequency (TF-IDF) and KeyBERT, in finding significant keywords from a collection of scientific journal abstracts. The dataset used consists of several scientific journal abstracts accompanied by manual keywords as a basis for assessment. The TF-IDF method relies on the frequency of words in the document, while KeyBERT utilizes a cosine similarity approach based on the BERT transformer model to determine the most meaningful keywords. The research findings show that the KeyBERT method and the TF-IDF method have a moderate level of similarity with semantic similarity values ​​of 0.578 for the KeyBERT method and 0.469 for the TF-IDF method, respectively. These results show significant potential for the use of machine learning and deep learning-based models with both methods for topic classification systems, especially in the fields of information retrieval and text mining.

References

B. Bahar, “Pengembangan Model Sistem Informasi Manajemen Pengelolaan Artikel Ilmiah Berbasis Web Menggunakan Metode Extreme Programming,” Jutisi J. Ilm. Tek. Inform. dan Sist. Inf., vol. 9, no. 3, p. 1, 2021, doi: 10.35889/jutisi.v9i3.537.

M. A. Shiddiq, “Ekstraksi Kata Kunci pada Artikel Menggunakan Metode Textrank,” vol. 1, no. 1, pp. 1–97, 2019.

A. Kurniawan, “Aplikasi sistem ekstraksi kata kunci berbahasa indonesia menggunakan algoritma textrank studi kasus data wikipedia Indonesia,” Repository.Uinjkt.Ac.Id, 2021.

R. Al Rasyid and D. H. U. Ningsih, “Penerapan Algoritma TF-IDF dan Cosine Similarity untuk Query Pencarian Pada Dataset Destinasi Wisata,” J. JTIK (Jurnal Teknol. Inf. dan Komunikasi), vol. 8, no. 1, pp. 170–178, 2024, doi: 10.35870/jtik.v8i1.1416.

R. Mihalcea and P. Tarau, “TextRank: Bringing order into texts,” Proc. 2004 Conf. Empir. Methods Nat. Lang. Process. EMNLP 2004 - A Meet. SIGDAT, a Spec. Interes. Gr. ACL held conjunction with ACL 2004, vol. 85, pp. 404–411, 2004.

M. Ciaramita and M. Johnson, “Supersense Tagging of Unknown Nouns in WordNet,” Proc. 2003 Conf. Empir. Methods Nat. Lang. Process. EMNLP 2003, pp. 168–175, 2003, doi: 10.3115/1119355.1119377.

C. Argueta and Y. S. Chen, “Multi-Lingual Sentiment Analysis of Social Data Based on Emotion-Bearing Patterns,” Soc. 2014 - 2nd Work. Nat. Lang. Process. Soc. Media, conjunction with COLING 2014, no. 101, pp. 38–43, 2014, doi: 10.3115/v1/w14-5906.

M. Grootendorst, “KeyBERT: Minimal keyword extraction with BERT.,” Zenodo. [Online]. Available: https://github.com/MaartenGr/KeyBERT

N. Reimers and I. Gurevych, “Sentence-BERT: Sentence embeddings using siamese BERT-networks,” EMNLP-IJCNLP 2019 - 2019 Conf. Empir. Methods Nat. Lang. Process. 9th Int. Jt. Conf. Nat. Lang. Process. Proc. Conf., pp. 3982–3992, 2019, doi: 10.18653/v1/d19-1410.

A. Bert, “Applied Information Technology and Computer Science Sentimen Analisis Overclaim Skincare Skintific Menggunakan,” vol. 3, no. 2, pp. 1–5, 2024.

J. Newburn, “Keyword Extraction in Bert-Based Models for Reviewer System,” no. May, 2023.

A. Yovi, S. Adi, E. Alexander, U. Katolik, D. Cendika, and K. Ngasem, “Merangkum Teks Word Dan Pdf,” vol. 12, no. 1, 2024.

Nabil Haidarrahman Pribadi, “Sistem Rekomendasi Karya IlmiahBerdasarkan Semantic SimilarityMenggunakan Fasttext Dan MetodeWord Mover’S Distance,” Insitur Teknol. Sepuluh Nop., 2020.

MEILINAEKA, “Web Scraping : Pengertian dan Fungsinya dalam Pengambilan Data.” [Online]. Available: https://it.telkomuniversity.ac.id/web-scraping-pengertian-dan-fungsinya-dalam-pengambilan-data/

J. Pragantha, T. Informatika, F. T. Informasi, and U. Tarumanagara, “Automatic Summarization Pada,” vol. 1, no. 1, pp. 71–78, 2017.

D. Darmanto, N. I. Pradasari, and E. Wahyudi, “Sistem Deteksi Plagiarisme Tugas Akhir Mahasiswa Berbasis Natural Language Processing Menggunakan Algoritma Jaro-Winkler dan TF-IDF,” Smart Comp: Jurnalnya Orang Pintar Komputer, vol. 13, no. 1, pp. 201–211, 2024.

A. Hexahost, “Text Preprocessing: Apa itu, Mengapa Penting, dan Bagaimana Melakukannya?,” Hexahost. [Online]. Available: https://hexahost.id/pengertian-text-preprocessing/

D. Septiani and I. Isabela, “Analisis Term Frequency Inverse Document Frequency (TF-IDF) Dalam Temu Kembali Informasi Pada Dokumen Teks,” SINTESIA J. Sist. dan Teknol. Inf. Indones., vol. 1, no. 2, pp. 81–88, 2023.

“Different Techniques for Sentence Semantic Similarity in NLP,” GeeksForGeeks. [Online]. Available: https://www.geeksforgeeks.org/different-techniques-for-sentence-semantic-similarity-in-nlp/

Published

2025-11-13

How to Cite

Suhartoyo, R., Julyo Armando Davincy Lin, V. ., Irsyad, H., & Rahman, A. (2025). Keyword Extraction of Scientific Journal Abstracts Using TF-IDF and KeyBERT Methods. Applied Information Technology and Computer Science (AICOMS), 4(2), 01-09. https://doi.org/10.58466/aicoms.v4i2.1814

Issue

Section

Artikel