Penjadwalan Crawler Website Menggunakan Metode Mining Data Record Dan Exponensial Smoothin

Cahyono, Warna Agung (2018) Penjadwalan Crawler Website Menggunakan Metode Mining Data Record Dan Exponensial Smoothin. Magister thesis, Universitas Brawijaya.

Abstract

Kajian pada bidang Competitive Intelligent, dan kajian pada bidang Web Crawling, mempunyai hubungan simbiosis mutualisme. Pada era informasi dewasa ini, website berfungsi sebagai sumber utamanya. Penelitian fokus pada bagaimana cara mendapatkan data dari website-website dan bagaimana cara memperlambat intensitas download. Permasalahan yang muncul adalah website sumber bersifat autonomous sehingga rentan perubahan struktur konten sewaktu-waktu. Masalah berikutnya adalah sistem intrusion detection snort yang terpasang di server untuk mendeteksi bot crawler menggunakan sistem intrusion detection snort sehingga ip komputer dan session kita terblokir. Peneliti mengusulkan crawling menggunakan metode Mining Data Record untuk information retrieval dan metode Exponential Smoothing untuk menjadwal kapan fetch/download supaya adaptif terhadap perubahan struktur konten yang berubah sewaktu- waktu dan untuk mengelabuhi website sumber supaya jadwal browse atau fetch otomatis mengikuti pola manusia umumnya. Information retrieval dimulai saat fetch/download dokumen HTML yang diikuti proses pembentukan tagtree, kemudian identifikasi datarecord oleh MDR(Mining Data Record). Setelah terbentuk dataregion-dataregion kemudian dilanjutkan proses pemecahan datarecord-datarecord setiap dataregion dan pengelompokan ulang datarecord-datarecord yang berdekatan kemiripannya oleh STM(Simple Tree Matching). Proses berakhir dengan pengarsipan dan aligning pola setiap dataregion menggunakan DEPTA(Data Extraction Partial Tree Alignment). Kemudian waktu download setiap datarecord pada setiap dataregion di konversi kedalam dataseri. Dataseri digunakan untuk meramalkan jumlah datarecord pada interval t+1. Setelah itu dikembalikan lagi kedomain waktu untuk menaksir kapankah jadwal fetch/download berikutnya, kembali ke proses information retrieval. Pengujian dilakukan terhadap 6 website dengan 3 hal yaitu, pertama seberapa valid proses tahap crawling dengan mengukur nilai recall, precission dan f-measure, kedua membandingkan jumlah data duplicate dan ketiga membandingkan jumlah data yang terlewat(hilang). Hasil ujicoba, dengan threshold edit distance levenshtein 0,3 untuk MDR dan score threshold similarity 0,65 untuk STM, didapatkan recall dan precision menghasilkan rata-rata nilai recall 92,6%, precision 100%, dan nilai rerata f-measure 96,4%. Sementara hasil tes estimasi eksponensial smoothing menggunakan α = 0.5 menghasilkan MAE 17.7 datarecord duplikat. Turun sebesar 4,1 datarecord duplikat dari MAE 21,8 jika menggunakan jadwal fetch yang fix. Penurunan jumlah data duplikat berarti terjadi penundaan/pelambatan jadwal fetch

English Abstract

Study on the Competitive field of Intelligent and studies in the field of Web Crawling have a symbiotic relationship mutualisme. In the information age today, the website serves as a main source. This research is focused in how do i get data from websites and how to slow down the intensity of the download. The problem arises due to the website sources are autonomous so that vulnerable changes the structure of the content at any time. The next problem is the system intrusion detection snort installed on the server to detect bot crawler. We propose to combine the method of Mining Data Record (MDR) for information retrieaval and the method of Exponential Smoothing for scheduling the fetcher/downloader. The aim of this scraping was to implement the adaptive of changes in the structure of the content. Then browse or fetch automatically follow the pattern of the occurrences of the news. Information retrieaval start with fetching/downloading the HTML content from server target, then building tagtree, then identification datarecord by MDR, then regrouping of new dataregion using STM(Simple Tree Matching), and the last updating each pattern of dataregion by alining using DEPTA(Data Extraction Partial Tree Alignment). Scheduling the fetcher start with converting time download of each datarecord to dataseries. Then ES(Exponensial Smoothing) use dataseries to estimate how many datarecord at next interval t+1. Then it is converted again to data time domain in order to know when the fetcher/downloader will run. Then the process repeat to information retrieaval process. The results of the tests, with the threshold 0.3 for MDR and similarity threshold score 0.65 to STM, using recall and precision values produce f-measure average 96.4%. While the results of the tests of the exponential estimation smoothing using α = 0.5 produces MAE 17.7 datarecord duplicate. It slowed down to 4.1 from datarecord 21.8 datarecord results schedule download/fetch fix in an average time of occurrence news.

Other obstract

Item Type:	Thesis (Magister)
Identification Number:	TES/025.042 2/CAH/p/2018/041809616
Uncontrolled Keywords:	MDR, STM, DEPTA, ES
Subjects:	000 Computer science, information and general works > 025 Operations of libraries, archives, information centers > 025.04 Information storage and retrieval systems > 025.042 World Wide Web > 025.042 2 Web sites
Divisions:	S2/S3 > Magister Teknik Elektro, Fakultas Teknik
Depositing User:	Endang Susworini
Date Deposited:	22 Jul 2022 04:30
Last Modified:	22 Jul 2022 04:30
URI:	http://repository.ub.ac.id/id/eprint/192530

Text
WARNA AGUNG CAHYONO.pdf
Download (4MB)

Actions (login required)

View Item