|
|
- # URL Analyzer
-
- URL data analyzer and extractor. Detect malicious signs and other useful data associated with URLs.
-
- ## About
-
- This program extract various website information based on URL addresses. This data can be used to analyze maliciousness of the given URL.
-
- ### Features
-
- **NOTE**: See sample JSON data: [Get file](sample_dataset.json)
-
- To summarize, the program does the following procedures for listed URLs:
-
- - Gets domain registrar
- - Gets webpage title and automatically compares it to the domain registrar name
- - Gets initial and final destination of a given URL
- - Analyzes whether final destination domain is same than the initial one
- - Gets URL redirects and HTTP response status codes
- - Fetches WHOIS data
- - Gets domain timestamps such as creation, update and expire days
- - Exact days & days relative to the current day
-
- - Gets content and number of iframes (for detecting possible XSS; Cross-Site Scripting)
-
- - Gets URL references on a webpage
- - **Local** domain referrals
- - **External** URL referrals
- - **Multidot** URLs (ones with `../` in the URL path)
- - Gets domain registrars for each URL
-
- ## Requirements
-
- ```
- Python 3
- Python 3 BeautifulSoup4 python-beautifulsoup4
- Python 3 whois <= 0.7.3 python-whois; PyPI
- Python 3 JSON Schema python-jsonschema
- Python 3 Numpy python-numpy
- Python 3 matplotlib python-matplotlib
- ```
-
- **NOTE**: Some Linux distributions may use `python3` executable instead of `python` for Python 3.
-
- ### Other requirements
-
- - Jupyter (recommended)
- - Working DNS name resolution
- - Internet connection
-
- ## Code
-
- - `jupyter notebook (python 3)`: [Get file](code/url-analyzer.ipynb)
-
- - `python 3`: [Get file](code/url-analyzer.py)
-
- ## Screenshots
-
- The following screenshots are generated with `matplotlib`
-
- ### Domains associated with HTML URL data
-
- ![](screenshots/domain_figure_hsfi.png)
-
- ![](screenshots/domain_figure_tsfi.png)
-
- **Purpose - WHOIS query lookup**:
- - Phishing campaigns register domains of websites from the same registrar
-
- ### Other analysis would reveal more
-
- Other analysis may give better insights such as:
-
- - **Initial and final URL**:
- - "Even if victim realizes he/she is visiting phishing website, he/she will be likely to report the randomly-generated URL of the visited website, and not that of the redirecting one, which makes blacklisting unable to stop the scam"
- - Phishing URLs may use multiple redirections to avoid blacklist detection
-
- - **Domain timestamps**:
- - Domains bought for short period of time (i.e. only one year) to avoid blacklisting
- - Domains are created/updated just before URL creation
-
- - **Domain name & local URL usage consistency**
-
- - **Domain name registration**:
- - "Legitimate websites are likely to register a domain name reflecting the brand or the service they represent."
-
- - **Domain name length**:
- - In phishing websites, URL tends to be much longer than legitimate websites. However, domains themselves tend to be much shorter (without TLD)
-
- - **URL analysis**
- - Phishing URLs often contain more number of dots and subdomains than legitimate URLs
- - "Researchers have observed that more than half of the phishing URLs are shortened to obfuscate the target URL and to hide malignant intentions rather than to gain character space"
-
- - **Robots.txt analysis**:
- - Legitimate robots.txt redirects bots to a legitimate domain rather than to the original phishing domain
-
- **HTML data keyterms identification**
- - Analysis of
- - Starting URL
- - Landing URL
- - title
- - text content
- - copyright marks
- - number of `iframe`s and `input` fields
- - Reference links (`href`, `src`, etc.)
-
- ## Known bugs issues and missing features
-
- - Non-UTF-8 character decoding not implemented
-
- - If multiple JSON data files exist, a wrong JSON data file is likely selected
-
- - Get URLs and other parameters from command line and/or associated `.conf` file
-
- - More data visualization and compherensive analysis
-
- - Null data may be generated in some cases
-
- - Add (unit) tests
-
- - Improve modularity of the codebase
-
- ## License
-
- N/A
|