Using regex comparisons and other advanced database features for real-time inspection of web server logs

In Storing server logs on CrateDB for fast search and aggregations we saw how we can get server logs sent to CrateDB in real-time, and for demo purposes we set up an instance of MediaWiki.
It was just an example, but it could have been any web server application.
Let’s now imagine that we suspect people are trying to perform SQL injection attacks against our website, we need to keep an eye on the logs.
We have already seen how we can use fulltext search to look for specific error messages, but would it not be great if we could have some rules inspecting the log entries as they come in, extracting relevant information, and flagging anything potentially suspicious?
There are a lot of nice features in CrateDB to support this kind of setup, let me show you an example.
The log entries our web server container is producing look like this: - - [11/Aug/2023:12:59:42 +0000] "GET /favicon.ico HTTP/1.1" 200 852 "" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/ Safari/537.36 Edg/115.0.1901.200"

We see we have the client IP address and the HTTP request that was sent.
Today I want to do 2 checks,

  • I want to query only for requests coming from a specific IP subnet,
  • and I want to see if the HTTP request has anything suspicious that could suggest an attempt to perform a SQL injection attack

I will use generated columns to extract this information from the log entries as they arrive.
CrateDB supports adding columns live to existing tables, but generated columns are special and they can only be added to empty tables, to add our new columns to our systemevents table without any downtime we will use another feature of CrateDB, table swapping.

Let’s create a new table with the 2 special columns:

CREATE TABLE doc.systemevents2 (
	message TEXT
	,INDEX message_ft USING FULLTEXT(message)
	,facility INTEGER
	,fromhost TEXT
	,priority INTEGER
	,DeviceReportedTime TIMESTAMP
	,ReceivedAt TIMESTAMP
	,SysLogTag TEXT	
	,clientip IP GENERATED ALWAYS AS TRY_CAST(btrim(split_part(message,'-',1)) AS IP)
	,suspectedSQLinjection BOOLEAN GENERATED ALWAYS AS message ~* 
				CONCAT('.*SELECT.*FROM.*' , 

Here we are extracting the client IP address from the message text and storing it using the dedicated IP data type in CrateDB. We use the split_part function to look up the string up to the dash symbol, then we use the btrim function to remove spaces from both sides of the string, and finally we use the TRY_CAST function so that log entries that do not have an IP address in this position get a NULL value as clientip but no error message is raised.

We are also using the case insensitive ~* regex comparison operator to look for indications of a possible SQL injection attack attempt, we are looking for occurrences of SELECT .. FROM , UNION ... SELECT , DELETE ... FROM , UPDATE ... SET , ALTER ... TABLE , or attempts to break a string delimiter injecting a single quote character. This will match entries like: - - [11/Aug/2023:13:03:07 +0000] "GET /mw-config/index.php?css=1%27%20WAITFOR%20DELAY%20%270%3A0%3A5%27%20AND%20%27Lshb%27%3D%27Lshb HTTP/1.1" 200 4627 "-" "sqlmap/1.7.8#pip ("

Let’s now swap in this new table and rename the old one as systemevents_archive:

ALTER CLUSTER SWAP TABLE doc.systemevents2 TO doc.systemevents;
GRANT DML ON TABLE doc.systemevents TO rsyslog;
REVOKE DML ON TABLE doc.systemevents2 FROM rsyslog;
ALTER TABLE doc.systemevents2 RENAME TO systemevents_archive;

To show how this work we will need both normal activity, which we can generate just by navigating to http://localhost/ (perhaps from another machine to get a different IP address in the logs), and we will also need some malicious-looking activity, to generate this against our website we can use a well-known pentesting tool sqlmap :

sudo pip install sqlmap
sqlmap -u http://localhost/ --crawl=2

Press ENTER when prompted to proceed with default settings, there will be around 10 questions before the tool concludes there are no obvious vulnerabilities in the website.

We can now run queries like:

FROM systemevents 
WHERE clientip << '' 
ORDER BY devicereportedtime DESC
LIMIT 100;

to get the most recent activity from IP addresses in the subnet, and we can also look for suspicious activity only:

FROM systemevents 
WHERE suspectedSQLinjection
ORDER BY devicereportedtime DESC
LIMIT 100;

This kind of queries could then be integrated into a Grafana dashboard or some alerting system.

I hope you found this interesting. Please do not hesitate to let us know your thoughts in the CrateDB Community.