URL data analyzer and extractor. Detect malicious signs and other useful data associated with URLs.
You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

133 lines
4.2 KiB

3 years ago
3 years ago
3 years ago
3 years ago
3 years ago
3 years ago
3 years ago
3 years ago
3 years ago
3 years ago
3 years ago
3 years ago
3 years ago
3 years ago
3 years ago
3 years ago
3 years ago
3 years ago
3 years ago
  1. # URL Analyzer
  2. URL data analyzer and extractor. Detect malicious signs and other useful data associated with URLs.
  3. ## About
  4. This program extract various website information based on URL addresses. This data can be used to analyze maliciousness of the given URL.
  5. ### Features
  6. **NOTE**: See sample JSON data: [Get file](sample_dataset.json)
  7. To summarize, the program does the following procedures for listed URLs:
  8. - Gets domain registrar
  9. - Gets webpage title and automatically compares it to the domain registrar name
  10. - Gets initial and final destination of a given URL
  11. - Analyzes whether final destination domain is same than the initial one
  12. - Gets URL redirects and HTTP response status codes
  13. - Fetches WHOIS data
  14. - Gets domain timestamps such as creation, update and expire days
  15. - Exact days & days relative to the current day
  16. - Gets content and number of iframes (for detecting possible XSS; Cross-Site Scripting)
  17. - Gets URL references on a webpage
  18. - **Local** domain referrals
  19. - **External** URL referrals
  20. - **Multidot** URLs (ones with `../` in the URL path)
  21. - Gets domain registrars for each URL
  22. ## Requirements
  23. ```
  24. Python 3
  25. Python 3 BeautifulSoup4 python-beautifulsoup4
  26. Python 3 whois <= 0.7.3 python-whois; PyPI
  27. Python 3 JSON Schema python-jsonschema
  28. Python 3 Numpy python-numpy
  29. Python 3 matplotlib python-matplotlib
  30. ```
  31. **NOTE**: Some Linux distributions may use `python3` executable instead of `python` for Python 3.
  32. ### Other requirements
  33. - Jupyter (recommended)
  34. - Working DNS name resolution
  35. - Internet connection
  36. ## Code
  37. - `jupyter notebook (python 3)`: [Get file](code/url-analyzer.ipynb)
  38. - `python 3`: [Get file](code/url-analyzer.py)
  39. ## Documents
  40. ### Report example
  41. - `URL domain registrar variation analysis`: [Get report](url-analysis-report.pdf)
  42. - Generated with command `jupyter nbconvert url-analyzer.ipynb --template hidecode --to pdf`
  43. ### Screenshots
  44. The following screenshots are generated with `matplotlib`
  45. ### Domains associated with HTML URL data
  46. ![](screenshots/domain_figure_hsfi.png)
  47. ![](screenshots/domain_figure_tsfi.png)
  48. **Purpose - WHOIS query lookup**:
  49. - Phishing campaigns register domains of websites from the same registrar
  50. ## Other analysis would reveal more
  51. Other analysis may give better insights such as:
  52. - **Initial and final URL**:
  53. - "Even if victim realizes he/she is visiting phishing website, he/she will be likely to report the randomly-generated URL of the visited website, and not that of the redirecting one, which makes blacklisting unable to stop the scam"
  54. - Phishing URLs may use multiple redirections to avoid blacklist detection
  55. - **Domain timestamps**:
  56. - Domains bought for short period of time (i.e. only one year) to avoid blacklisting
  57. - Domains are created/updated just before URL creation
  58. - **Domain name & local URL usage consistency**
  59. - **Domain name registration**:
  60. - "Legitimate websites are likely to register a domain name reflecting the brand or the service they represent."
  61. - **Domain name length**:
  62. - In phishing websites, URL tends to be much longer than legitimate websites. However, domains themselves tend to be much shorter (without TLD)
  63. - **URL analysis**
  64. - Phishing URLs often contain more number of dots and subdomains than legitimate URLs
  65. - "Researchers have observed that more than half of the phishing URLs are shortened to obfuscate the target URL and to hide malignant intentions rather than to gain character space"
  66. - **Robots.txt analysis**:
  67. - Legitimate robots.txt redirects bots to a legitimate domain rather than to the original phishing domain
  68. **HTML data keyterms identification**
  69. - Analysis of
  70. - Starting URL
  71. - Landing URL
  72. - title
  73. - text content
  74. - copyright marks
  75. - number of `iframe`s and `input` fields
  76. - Reference links (`href`, `src`, etc.)
  77. ## Known bugs issues and missing features
  78. - Non-UTF-8 character decoding not implemented
  79. - If multiple JSON data files exist, a wrong JSON data file is likely selected
  80. - Get URLs and other parameters from command line and/or associated `.conf` file
  81. - More data visualization and compherensive analysis
  82. - Null data may be generated in some cases
  83. - Add (unit) tests
  84. - Improve modularity of the codebase
  85. ## License
  86. N/A