Simple Apache/HTTPD log parser for administrative analysis
You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

365 lines
19 KiB

4 years ago
4 years ago
4 years ago
4 years ago
4 years ago
4 years ago
4 years ago
  1. # Apache log parser
  2. Simple Apache/HTTPD command-line log parser for short analysis, targeted to web server administration tasks.
  3. Unix-alike systems only.
  4. ## Motivation
  5. Keep it simple. Very simple.
  6. Although advanced and nice-looking log analytic tools such as [Elastic Stack](https://www.elastic.co/products/) exists (I have used it), I wanted something far more simple and with far less overhead for weekly tasks and for configuring an Apache web server. Therefore, I wrote this simple Python script to parse Apache web server logs.
  7. **Advantages** of this tool are little overhead, piping output to other Unix tools and doing some quick log checks. The main idea is to give desired output for short analysis so that you can properly configure your web server protection mechanisms and network environment based on the actual server data.
  8. This tool is not for intrusion detection/prevention or does not alert administration about hostile penetration attempts. However, it may reveal simple underlying misconfigurations such as invalid URL references on your site.
  9. ## Requirements
  10. Following Arch Linux packages. If you use another distribution, refer to corresponding packages:
  11. ```
  12. python
  13. python-apachelogs
  14. ```
  15. [python-apachelogs](https://github.com/jwodder/apachelogs/) is not available either on Arch Linux repositories or AUR repositories. Therefore, I provide a PKGBUILD file to install it. [python-apachelogs - PKGBUILD](python-apachelogs/PKGBUILD)
  16. `python-apachelogs` has a sub-dependency of [python-pydicti](python-apachelogs/python-pydicti/PKGBUILD) package.
  17. Recommended packages for IP address geo-location:
  18. ```
  19. geoip
  20. geoip-database
  21. ```
  22. ## Installation
  23. Arch Linux:
  24. run `updpkgsums && makepkg -Cfi` in [apache-logparser](apache-logparser/) directory. Installs `httpd-logparser` executable file in `/usr/bin/` folder.
  25. ## Examples
  26. **Q: Can you list me unique connections (IP addresses) associated with country and city location data, using the last Apache log file?**
  27. ```
  28. httpd-logparser --outfields time remote_host country city -d /var/log/httpd/ -f access_log$ -np --stats | sort -k 3 -u | sort -k 4
  29. Processed files: access_log
  30. Matched log entries: 724
  31. Processed log entries: 724
  32. 2021-06-06 10:00:57 135.23.195.XXX Canada Quebec
  33. 2021-06-06 04:58:58 8.210.233.XXX China Guangzhou
  34. 2021-06-06 05:01:37 23.228.109.XXX China Shanghai
  35. 2021-06-06 04:49:57 8.210.71.XXX China Unknown: 34.772499, 113.726601
  36. 2021-06-06 09:47:32 92.151.100.XXX France Boulogne-Billancourt
  37. 2021-06-06 02:05:38 195.154.122.XXX France Ivry-sur-Seine
  38. 2021-06-06 03:24:22 92.116.45.XXX Germany Bielefeld
  39. 2021-06-06 06:06:58 207.154.218.XXX Germany Frankfurt am Main
  40. 2021-06-06 10:45:40 172.105.77.XXX Germany Frankfurt am Main
  41. 2021-06-06 00:25:20 92.116.52.XXX Germany Hamm
  42. 2021-06-06 05:02:54 159.69.10.XXX Germany Mannheim
  43. 2021-06-06 06:24:55 89.246.127.XXX Germany Schloss Holte-Stukenbrock
  44. 2021-06-06 10:08:21 138.201.56.XXX Germany Unknown: 51.299301, 9.490900
  45. 2021-06-06 03:42:02 47.31.198.XXX India Delhi
  46. 2021-06-06 00:15:16 92.118.160.XXX Lithuania Unknown: 56.000000, 24.000000
  47. 2021-06-06 02:10:21 92.118.160.XXX Lithuania Unknown: 56.000000, 24.000000
  48. 2021-06-06 02:32:48 92.118.160.XXX Lithuania Unknown: 56.000000, 24.000000
  49. 2021-06-06 03:26:22 92.118.160.XXX Lithuania Unknown: 56.000000, 24.000000
  50. 2021-06-06 06:52:23 92.118.160.XXX Lithuania Unknown: 56.000000, 24.000000
  51. 2021-06-06 07:00:48 92.118.160.XXX Lithuania Unknown: 56.000000, 24.000000
  52. 2021-06-06 11:10:59 92.118.160.XXX Lithuania Unknown: 56.000000, 24.000000
  53. 2021-06-06 00:23:05 92.118.160.XXX Lithuania Unknown: 56.000000, 24.000000
  54. 2021-06-06 02:46:33 92.118.160.XXX Lithuania Unknown: 56.000000, 24.000000
  55. 2021-06-06 05:11:20 45.131.212.XXX Netherlands Amsterdam
  56. 2021-06-06 05:12:40 185.180.143.XXX Portugal Unknown: 38.705700, -9.135900
  57. 2021-06-06 07:55:47 89.137.179.XXX Romania Timisoara
  58. 2021-06-06 06:10:46 91.243.100.XXX Russian Federation Novocherkassk
  59. 2021-06-06 11:30:51 213.177.208.XXX Spain Palencia
  60. 2021-06-06 01:41:48 184.22.158.XXX Thailand Thalang
  61. 2021-06-06 08:14:41 176.88.78.XXX Turkey Ankara
  62. 2021-06-06 08:32:04 212.82.66.XXX United Kingdom Burnham
  63. 2021-06-06 03:53:41 45.146.164.XXX United Kingdom London
  64. 2021-06-06 04:33:42 185.158.250.XXX United Kingdom Manchester
  65. 2021-06-06 10:16:19 82.10.88.XXX United Kingdom Shrewsbury
  66. 2021-06-06 10:14:28 40.77.189.XXX United States Chicago
  67. 2021-06-06 08:16:07 69.170.221.XXX United States Colorado Springs
  68. 2021-06-06 10:57:25 192.241.206.XXX United States San Francisco
  69. 2021-06-06 01:09:16 128.14.209.XXX United States Unknown: 37.750999, -97.821999
  70. 2021-06-06 06:44:49 47.243.113.XXX United States Unknown: 37.750999, -97.821999
  71. 2021-06-06 06:45:48 47.243.116.XXX United States Unknown: 37.750999, -97.821999
  72. 2021-06-06 08:00:40 162.244.34.XXX United States Unknown: 37.750999, -97.821999
  73. 2021-06-06 10:30:53 47.242.214.XXX United States Unknown: 37.750999, -97.821999
  74. 2021-06-06 04:22:27 162.244.33.XXX United States Unknown: 37.750999, -97.821999
  75. 2021-06-06 04:34:47 47.243.48.XXX United States Unknown: 37.750999, -97.821999
  76. 2021-06-06 06:37:16 47.243.109.XXX United States Unknown: 37.750999, -97.821999
  77. 2021-06-06 06:42:37 162.244.33.XXX United States Unknown: 37.750999, -97.821999
  78. 2021-06-06 06:44:49 47.243.109.XXX United States Unknown: 37.750999, -97.821999
  79. 2021-06-06 07:04:20 47.243.113.XXX United States Unknown: 37.750999, -97.821999
  80. 2021-06-06 07:44:23 47.243.110.XXX United States Unknown: 37.750999, -97.821999
  81. 2021-06-06 08:29:33 47.242.12.XXX United States Unknown: 37.750999, -97.821999
  82. 2021-06-06 10:38:15 128.14.133.XXX United States Unknown: 37.750999, -97.821999
  83. 2021-06-06 03:18:25 23.95.132.XXX United States Unknown: 37.750999, -97.821999
  84. 2021-06-06 04:13:55 128.1.248.XXX United States Unknown: 37.750999, -97.821999
  85. 2021-06-06 08:21:11 64.62.197.XXX United States Unknown: 37.750999, -97.821999
  86. 2021-06-06 11:17:33 47.243.95.XXX United States Unknown: 37.750999, -97.821999
  87. 2021-06-06 08:03:24 167.56.236.XXX Uruguay Castillos
  88. ```
  89. NOTE: The last numerical part of all ip addresses are anonymized with `XXX` string.
  90. **Q: How many valid requests from Finland and Sweden occured between 15th - 24th April 2020?**
  91. ```
  92. httpd-logparser --outfields time http_status country -d /var/log/httpd/ -c ^20* -f access_log* -cf Finland Sweden -dl "15-04-2020" -du "24-04-2020" --sortby time --stats
  93. Processing file: access_log
  94. Processing file: access_log.1
  95. Processing file: access_log.2
  96. Processing file: access_log.3
  97. Processing file: access_log.4
  98. Processing log entry: 883
  99. 2020-04-17 08:47:05 200 Finland
  100. 2020-04-17 08:47:05 200 Finland
  101. 2020-04-17 08:47:05 200 Finland
  102. 2020-04-17 08:47:05 200 Finland
  103. 2020-04-17 08:47:05 200 Finland
  104. 2020-04-17 08:47:05 200 Finland
  105. 2020-04-17 08:47:05 200 Finland
  106. ...
  107. ...
  108. 2020-04-23 18:04:07 200 Finland
  109. 2020-04-23 18:04:07 200 Finland
  110. 2020-04-23 18:04:07 200 Finland
  111. 2020-04-23 18:04:07 200 Finland
  112. 2020-04-23 18:04:07 200 Finland
  113. 2020-04-23 18:04:07 200 Finland
  114. 2020-04-23 18:04:08 200 Finland
  115. Processed files: access_log, access_log.1, access_log.2, access.log_3, access_log.4
  116. Processed log entries: 883
  117. Matched log entries: 211
  118. ```
  119. **Q: How many redirects have occured since 01st April 2020?**
  120. ```
  121. httpd-logparser --outfields time http_status country -d /var/log/httpd/ -c ^30* -f access_log* -dl "01-04-2020" --sortby time --stats
  122. Processing file: access_log
  123. Processing file: access_log.1
  124. Processing file: access_log.2
  125. Processing file: access_log.3
  126. Processing file: access_log.4
  127. Processing log entry: 8993
  128. 2020-04-01 02:13:12 302 United States
  129. 2020-04-01 02:13:12 302 United States
  130. 2020-04-01 02:13:13 301 United States
  131. 2020-04-01 02:13:13 302 United States
  132. 2020-04-01 02:13:14 302 United States
  133. 2020-04-01 02:13:14 302 United States
  134. 2020-04-01 02:13:14 302 United States
  135. 2020-04-01 02:13:15 302 United States
  136. 2020-04-01 02:13:15 302 United States
  137. 2020-04-01 03:25:06 302 United States
  138. 2020-04-01 04:03:39 302 Russian Federation
  139. 2020-04-01 04:03:44 302 Russian Federation
  140. ...
  141. ...
  142. 2020-05-01 18:53:05 302 Italy
  143. 2020-05-01 18:53:21 301 Italy
  144. 2020-05-01 18:53:22 301 Italy
  145. 2020-05-01 18:53:24 302 Italy
  146. 2020-05-01 18:53:25 302 Italy
  147. 2020-05-01 18:53:26 302 Italy
  148. 2020-05-01 18:53:26 302 Italy
  149. 2020-05-01 18:54:20 302 Italy
  150. 2020-05-01 19:18:15 301 Russian Federation
  151. 2020-05-01 19:18:15 301 Russian Federation
  152. 2020-05-01 19:18:15 301 Russian Federation
  153. 2020-05-01 19:18:17 301 Russian Federation
  154. 2020-05-01 19:21:19 302 France
  155. Processed files: access_log, access_log.1, access_log.2, access_log.3, access_log.4
  156. Processed log entries: 8994
  157. Matched log entries: 3207
  158. ```
  159. **Q: How many `4XX` codes have connected clients from China and United States produced in all time?**
  160. ```
  161. httpd-logparser --outfields time country http_status http_request -d /var/log/httpd/ -c ^4 -f access_log* -cf "United States" China --sortby time --stats
  162. Processing file: access_log
  163. Processing file: access_log.1
  164. Processing file: access_log.2
  165. Processing file: access_log.3
  166. Processing file: access_log.4
  167. Processing log entry: 10221
  168. 2020-03-29 18:49:34 United States 408 None
  169. 2020-03-29 18:49:34 United States 408 None
  170. 2020-03-29 19:28:02 China 408 None
  171. 2020-04-08 06:14:48 China 400 GET /phpMyAdmin/scripts/setup.php HTTP/1.1
  172. 2020-04-08 06:14:53 China 400 GET /horde/imp/test.php HTTP/1.1
  173. 2020-04-08 06:14:54 China 400 GET /login?from=0.000000 HTTP/1.1
  174. ...
  175. ...
  176. 2020-04-24 10:40:16 United States 403 GET /MAPI/API HTTP/1.1
  177. 2020-04-24 11:33:16 United States 403 GET /owa/auth/logon.aspx?url=https%3a%2f%2f1%2fecp%2f HTTP/1.1
  178. 2020-04-24 13:00:12 United States 403 GET /cgi-bin/luci HTTP/1.1
  179. 2020-04-24 13:00:13 United States 403 GET /dana-na/auth/url_default/welcome.cgi HTTP/1.1
  180. 2020-04-24 13:00:15 United States 403 GET /remote/login?lang=en HTTP/1.1
  181. 2020-04-24 13:00:17 United States 403 GET /index.asp HTTP/1.1
  182. 2020-04-24 13:00:18 United States 403 GET /htmlV/welcomeMain.htm HTTP/1.1
  183. 2020-04-24 20:08:20 United States 403 GET /dana-na/auth/url_default/welcome.cgi HTTP/1.1
  184. 2020-04-24 20:08:22 United States 403 GET /remote/login?lang=en HTTP/1.1
  185. 2020-04-25 03:57:39 United States 403 GET /home.asp HTTP/1.1
  186. 2020-04-25 03:57:39 United States 403 GET /login.cgi?uri= HTTP/1.1
  187. 2020-04-25 03:57:39 United States 403 GET /vpn/index.html HTTP/1.1
  188. 2020-04-25 03:57:39 United States 403 GET /cgi-bin/luci HTTP/1.1
  189. 2020-04-25 03:57:40 United States 403 GET /dana-na/auth/url_default/welcome.cgi HTTP/1.1
  190. 2020-04-25 03:57:40 United States 403 GET /remote/login?lang=en HTTP/1.1
  191. 2020-04-25 03:57:40 United States 403 GET /index.asp HTTP/1.1
  192. 2020-04-25 03:57:40 United States 403 GET /htmlV/welcomeMain.htm HTTP/1.1
  193. 2020-04-25 11:56:32 United States 403 GET /owa/auth/logon.aspx?url=https%3a%2f%2f1%2fecp%2f HTTP/1.1
  194. 2020-04-25 21:29:50 United States 403 GET /images/favicon-32x32.png HTTP/1.1
  195. 2020-04-25 21:30:08 United States 408 None
  196. Processed files: access_log, access_log.1, access_log.2, access_log.3, access_log.4
  197. Processed log entries: 10222
  198. Matched log entries: 90
  199. ```
  200. **Q: Which user agents are used by all clients in all time?**
  201. ```
  202. httpd-logparser --outfields user_agent -d /var/log/httpd/ -f access_log* --noprogress | sort -u
  203. facebookexternalhit/1.1 (+http://www.facebook.com/externalhit_uatext.php)
  204. fasthttp
  205. Go-http-client/1.1
  206. HTTP Banner Detection (https://security.ipip.net)
  207. kubectl/v1.12.0 (linux/amd64) kubernetes/0ed3388
  208. libwww-perl/5.833
  209. libwww-perl/6.06
  210. libwww-perl/6.43
  211. Microsoft Office Word 2014
  212. Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0)
  213. Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)
  214. Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322; .NET CLR 2.0.50728)
  215. Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 10.0; Win64; x64; Trident/7.0; .NET4.0C; .NET4.0E; .NET CLR 2.0.50727; .NET CLR 3.0.30729; .NET CLR 3.5.30729; Tablet PC 2.0)
  216. Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; WOW64; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E; InfoPath.2)
  217. ...
  218. ...
  219. Mozilla/5.0 (X11; Linux x86_64; rv:68.0) Gecko/20100101 Firefox/68.0
  220. Mozilla/5.0 (X11; Linux x86_64; rv:73.0) Gecko/20100101 Firefox/73.0
  221. Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:58.0) Gecko/20100101 Firefox/58.0
  222. Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:62.0) Gecko/20100101 Firefox/62.0
  223. Mozilla/5.0 zgrab/0.x
  224. Mozilla/5.0 zgrab/0.x (compatible; Researchscan/t12sns; +http://researchscan.comsys.rwth-aachen.de)
  225. Mozilla/5.0 zgrab/0.x (compatible; Researchscan/t13rl; +http://researchscan.comsys.rwth-aachen.de)
  226. NetSystemsResearch studies the availability of various services across the internet. Our website is netsystemsresearch.com
  227. None
  228. python-requests/1.2.3 CPython/2.7.16 Linux/4.14.165-102.185.amzn1.x86_64
  229. python-requests/2.10.0
  230. python-requests/2.19.1
  231. python-requests/2.22.0
  232. python-requests/2.23.0
  233. python-requests/2.6.0 CPython/2.7.5 Linux/3.10.0-1062.12.1.el7.x86_64
  234. python-requests/2.6.0 CPython/2.7.5 Linux/3.10.0-1062.18.1.el7.x86_64
  235. Python-urllib/3.7
  236. Ruby
  237. Wget/1.19.4 (linux-gnu)
  238. WinHTTP/1.1
  239. ```
  240. **Q: Time difference between a single client requests? Exclude Finland! Include only the most recent access_log file.**
  241. ```
  242. httpd-logparser --outfields http_status time time_diff country -d /var/log/httpd/ -cf "\!Finland" -f access_log$
  243. 200 2020-05-01 18:53:07 +2.0 Italy
  244. 200 2020-05-01 18:53:19 +12.0 Italy
  245. 200 2020-05-01 18:53:20 +1.0 Italy
  246. 200 2020-05-01 18:53:20 0.0 Italy
  247. 200 2020-05-01 18:53:21 +1.0 Italy
  248. 200 2020-05-01 18:53:20 -1.0 Italy
  249. 200 2020-05-01 18:53:21 +1.0 Italy
  250. 200 2020-05-01 18:53:21 0.0 Italy
  251. 301 2020-05-01 18:53:21 0.0 Italy
  252. 301 2020-05-01 18:53:22 +1.0 Italy
  253. 200 2020-05-01 18:53:22 0.0 Italy
  254. 200 2020-05-01 18:53:22 0.0 Italy
  255. 200 2020-05-01 18:53:23 +1.0 Italy
  256. 200 2020-05-01 18:53:23 0.0 Italy
  257. 302 2020-05-01 18:53:24 +1.0 Italy
  258. 200 2020-05-01 18:53:24 0.0 Italy
  259. 200 2020-05-01 18:53:25 +1.0 Italy
  260. 302 2020-05-01 18:53:25 0.0 Italy
  261. 302 2020-05-01 18:53:26 +1.0 Italy
  262. 302 2020-05-01 18:53:26 0.0 Italy
  263. 200 2020-05-01 18:53:26 0.0 Italy
  264. 200 2020-05-01 18:53:27 +1.0 Italy
  265. 200 2020-05-01 18:53:32 +5.0 Italy
  266. 302 2020-05-01 18:54:20 +48.0 Italy
  267. 408 2020-05-01 18:54:40 +20.0 Italy
  268. ...
  269. ...
  270. 200 2020-05-01 22:14:36 NEW_CONN Russian Federation
  271. 200 2020-05-01 22:30:40 +964.0 Russian Federation
  272. 500 2020-05-01 22:35:01 NEW_CONN Singapore
  273. 500 2020-05-01 22:35:06 +5.0 Singapore
  274. 500 2020-05-01 22:35:09 +3.0 Singapore
  275. 500 2020-05-01 22:35:14 +5.0 Singapore
  276. 200 2020-05-01 22:37:47 NEW_CONN Russian Federation
  277. ...
  278. ...
  279. ```
  280. ## Usage
  281. ```
  282. usage: httpd-logparser [-h] -d [LOG_DIR] -f LOG_FILE [LOG_FILE ...] [-s [LOG_SYNTAX]] [-c STATUS_CODE [STATUS_CODE ...]] [-cf COUNTRY [COUNTRY ...]] [-ot [OUT_TIMEFORMAT]] [-of OUT_FIELD [OUT_FIELD ...]] [-ng] [-gd [GEODB]] [-dl [DAY_LOWER]] [-du [DAY_UPPER]]
  283. [-sb [SORTBY_FIELD]] [-sbr [SORTBY_FIELD_REVERSE]] [-st] [-np]
  284. optional arguments:
  285. -h, --help show this help message and exit
  286. -d [LOG_DIR], --dir [LOG_DIR]
  287. Apache log file directory.
  288. -f LOG_FILE [LOG_FILE ...], --files LOG_FILE [LOG_FILE ...]
  289. Apache log files. Regular expressions supported.
  290. -s [LOG_SYNTAX], --logsyntax [LOG_SYNTAX]
  291. Apache log files syntax, defined as "LogFormat" directive in Apache configuration.
  292. -c STATUS_CODE [STATUS_CODE ...], --statuscodes STATUS_CODE [STATUS_CODE ...]
  293. Print only these status codes. Regular expressions supported.
  294. -cf COUNTRY [COUNTRY ...], --countryfilter COUNTRY [COUNTRY ...]
  295. Include only these countries. Negative match (exclude): "\!Country"
  296. -ot [OUT_TIMEFORMAT], --outtimeformat [OUT_TIMEFORMAT]
  297. Output time format. Default: "%d-%m-%Y %H:%M:%S"
  298. -of OUT_FIELD [OUT_FIELD ...], --outfields OUT_FIELD [OUT_FIELD ...]
  299. Output fields. Default: log_file_name, http_status, remote_host, country, city, time, time_diff, user_agent, http_request
  300. -ng, --nogeo Skip country check with external "geoiplookup" tool.
  301. -gd [GEODB], --geodir [GEODB]
  302. Database file directory for "geoiplookup" tool. Default: /usr/share/GeoIP/
  303. -dl [DAY_LOWER], --daylower [DAY_LOWER]
  304. Do not check log entries older than this day. Day syntax: 31-12-2020
  305. -du [DAY_UPPER], --dayupper [DAY_UPPER]
  306. Do not check log entries newer than this day. Day syntax: 31-12-2020
  307. -sb [SORTBY_FIELD], --sortby [SORTBY_FIELD]
  308. Sort by an output field.
  309. -sbr [SORTBY_FIELD_REVERSE], --sortbyreverse [SORTBY_FIELD_REVERSE]
  310. Sort by an output field, reverse order.
  311. -st, --stats Show short statistics at the end.
  312. -np, --noprogress Do not show progress information.
  313. ```
  314. ## License
  315. GPLv3.