Logstash: import RSS feed data

Posted by holowugz on Sun, 30 Jan 2022 18:46:39 +0100

In the actual use process, we sometimes want to import and search RSS data. In many real micro services, a lot of data is provided in the form of RSS feed, such as our common comment websites. So is there any way to import these data into Elasticsearch and search? The answer is to use the information provided by Logstash RSS input plugin . In today's article, I will use an example to show.

We first find an RSS feed. We can see such a page on Elastic's official website elastic.co/blog . We open this page:

Click RSS above:

Above, we can see the response result of RSS. It contains title, description and other information. We can use RSS input plugin to extract these.

 

install

Since RSS input plugin is not one of Logstash input plugin s, we must install it manually. If you want to know the input plugin of Logstash, you can use the following command:

./bin/logstash-plugin list --group input

The above command shows:

logstash-input-azure_event_hubs
logstash-input-beats
└── logstash-input-elastic_agent (alias)
logstash-input-couchdb_changes
logstash-input-elasticsearch
logstash-input-exec
logstash-input-file
logstash-input-ganglia
logstash-input-gelf
logstash-input-generator
logstash-input-graphite
logstash-input-heartbeat
logstash-input-http
logstash-input-http_poller
logstash-input-imap
logstash-input-jms
logstash-input-pipe
logstash-input-redis
logstash-input-rss
logstash-input-s3
logstash-input-snmp
logstash-input-snmptrap
logstash-input-sqs
logstash-input-stdin
logstash-input-syslog
logstash-input-tcp
logstash-input-twitter
logstash-input-udp
logstash-input-unix

The RSS input plugin is obviously not in it. We can use the following command to install:

./bin/logstash-plugin install logstash-input-rss
$ ./bin/logstash-plugin install logstash-input-rss
Using JAVA_HOME defined java: /Library/Java/JavaVirtualMachines/jdk-15.0.2.jdk/Contents/Home
WARNING, using JAVA_HOME while Logstash distribution comes with a bundled JDK
Validating logstash-input-rss
Installing logstash-input-rss
Installation successful

Once the installation is successful, we will use the above list command again to check, and we will find that logstash input RSS will be included in it.

 

Import RSS feed data

Next, we use the RSS input plugin installed above to import https://www.elastic.co/blog/feed Data in. First, we create the following configuration files in the root directory of Logstash installation:

logstash.conf

input {
  rss {
    url => "https://www.elastic.co/blog/feed"
    interval => 120
    tags => ["rss", "elastic"]
  }
}

filter { 
  mutate { 
    rename => [ "message", "blog_html" ] 
    copy => { "blog_html" => "blog_text" } 
    copy => { "published" => "@timestamp" } 
  } 
  mutate { 
    gsub => [  
      "blog_text", "<.*?>", "", 
      "blog_text", "[\n\t]", " " 
    ] 
    remove_field => [ "published", "author" ] 
  }
} 
 
output {
  stdout {
    codec => rubydebug
  }

  elasticsearch { 
    hosts => [ "localhost:9200" ] 
    index => "elastic_blog" 
  }
}

In the input section above, we defined rss. Its url is https://www.elastic.co/blog/feed . The interval is 120, that is, 120 seconds. Run every two minutes. At the same time, we added the tags we want to facilitate our data search. In the filter section, we rename message to blog_html, and the fields are copied at the same time. In the change part, we replace all the contents in < > brackets with "", that is, remove them. At the same time, it also replaces \ n\t such characters as null "". Finally, we delete the published and author fields, although it is unnecessary to delete the author field.

Run the following command in the Logstash installation directory:

./bin/logstash -f logstash.conf

We can see in the console:

We can see in the blog_text, those contents in < > tag have been removed.

We can find the index elastic in elastic search_ blog:

GET elastic_blog/_count
{
  "count" : 20,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  }
}

We can even search it:

 

After another two minutes, we will review the imported data again:

This time, we see that the count becomes 40, which is twice the first 20. This is because we re read the data every 2 minutes. If we wait another two minutes, the data will become 60:

Well, in today's exercise. We described how to use the Logstash RSS input plugin to import RSS feed data. I hope it will be helpful to your work.

Topics: Big Data ElasticSearch elastic