URL data analyzer and extractor. Detect malicious signs and other useful data associated with URLs.
You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
 
 
Pekka Helenius 3df3cb660d Code clean-up: remove redundant whitespaces 3 years ago
code Code clean-up: remove redundant whitespaces 3 years ago
screenshots Add screenshots 3 years ago
README.md Update README 3 years ago
sample_dataset.json Prettify sample JSON data 3 years ago

README.md

URL Analyzer

URL data analyzer and extractor. Detect malicious signs and other useful data associated with URLs.

About

This program extract various website information based on URL addresses. This data can be used to analyze maliciousness of the given URL.

Features

NOTE: See sample JSON data: Get file

To summarize, the program does the following procedures for listed URLs:

  • Gets domain registrar

  • Gets webpage title and automatically compares it to the domain registrar name

  • Gets initial and final destination of a given URL

    • Analyzes whether final destination domain is same than the initial one
  • Gets URL redirects and HTTP response status codes

  • Fetches WHOIS data

    • Gets domain timestamps such as creation, update and expire days
      • Exact days & days relative to the current day
  • Gets content and number of iframes (for detecting possible XSS; Cross-Site Scripting)

  • Gets URL references on a webpage

    • Local domain referrals
    • External URL referrals
    • Multidot URLs (ones with ../ in the URL path)
      • Gets domain registrars for each URL

Requirements

Python 3
Python 3 BeautifulSoup4   python-beautifulsoup4
Python 3 whois <= 0.7.3   python-whois; PyPI
Python 3 JSON Schema      python-jsonschema
Python 3 Numpy            python-numpy
Python 3 matplotlib       python-matplotlib

NOTE: Some Linux distributions may use python3 executable instead of python for Python 3.

Other requirements

  • Jupyter (recommended)
  • Working DNS name resolution
  • Internet connection

Code

Screenshots

The following screenshots are generated with matplotlib

Domains associated with HTML URL data

Known bugs issues and missing features

  • Non-UTF-8 character decoding not implemented

  • If multiple JSON data files exist, a wrong JSON data file is likely selected

  • Get URLs and other parameters from command line

  • More data visualization and compherensive analysis

  • Null data may be generated in some cases

  • Add (unit) tests

License

N/A