o
    Rŀgz                     @   s   d Z ddlZddlmZ ddlmZ ddlmZ ddlmZ ddlm	Z	 ddlm
Z
 d	Zd
ZeeZee ZdddZdd Zdd Zdd Zdd Zdd ZG dd dZG dd deZG dd dZedkrxddlmZ e  dS dS ) a*  Bio.SearchIO parser for BLAT output formats.

This module adds support for parsing BLAT outputs. BLAT (BLAST-Like Alignment
Tool) is a sequence similarity search program initially built for annotating
the human genome.

Bio.SearchIO.BlastIO was tested using standalone BLAT version 34, psLayout
version 3. It should be able to parse psLayout version 4 without problems.

More information on BLAT is available from these sites:

    - Publication: http://genome.cshlp.org/content/12/4/656
    - User guide: http://genome.ucsc.edu/goldenPath/help/blatSpec.html
    - Source download: http://www.soe.ucsc.edu/~kent/src
    - Executable download: http://hgdownload.cse.ucsc.edu/admin/exe/
    - Blat score calculation: http://genome.ucsc.edu/FAQ/FAQblat.html#blat4


Supported Formats
=================

BlatIO supports parsing, indexing, and writing for both PSL and PSLX output
formats, with or without header. To parse, index, or write PSLX files, use the
'pslx' keyword argument and set it to True.

    >>> # blat-psl defaults to PSL files
    >>> from Bio import SearchIO
    >>> psl = 'Blat/psl_34_004.psl'
    >>> qresult = SearchIO.read(psl, 'blat-psl')
    >>> qresult
    QueryResult(id='hg19_dna', 10 hits)

    >>> # set the pslx flag to parse PSLX files
    >>> pslx = 'Blat/pslx_34_004.pslx'
    >>> qresult = SearchIO.read(pslx, 'blat-psl', pslx=True)
    >>> qresult
    QueryResult(id='hg19_dna', 10 hits)

For parsing and indexing, you do not need to specify whether the file has a
header or not. For writing, if you want to write a header, you can set the
'header' keyword argument to True. This will write a 'psLayout version 3' header
to your output file.

    >>> from Bio import SearchIO
    >>> qresult = SearchIO.read(psl, "blat-psl")
    >>> SearchIO.write(qresult, "example.psl", "blat-psl", header=True)
    (1, 10, 19, 23)
    >>> import os
    >>> os.remove("example.psl")

Note that the number of HSPFragments written may exceed the number of HSP
objects. This is because in PSL files, it is possible to have single matches
consisting of noncontiguous sequence fragments. This is where the HSPFragment
object comes into play. These fragments are grouped into a single HSP because
they share the same statistics (e.g. match numbers, BLAT score, etc.). However,
they do not share the same sequence attributes, such as the start and end
coordinates, making them distinct objects.

In addition to parsing PSL(X) files, BlatIO also computes the percent identities
and scores of your search results. This is done using the calculation formula
posted here: http://genome.ucsc.edu/FAQ/FAQblat.html#blat4. It mimics the score
and percent identity calculation done by UCSC's web BLAT service.

Since BlatIO parses the file in a single pass, it expects all results from
the same query to be in consecutive rows. If the results from one query are
spread in nonconsecutive rows, BlatIO will consider them to be separate
QueryResult objects.

In most cases, the PSL(X) format uses the same coordinate system as Python
(zero-based, half open). These coordinates are anchored on the plus strand.
However, if the query aligns on the minus strand, BLAT will anchor the qStarts
coordinates on the minus strand instead. BlatIO is aware of this, and will
re-anchor the qStarts coordinates to the plus strand whenever it sees a minus
strand query match. Conversely, when you write out to a PSL(X) file, BlatIO will
reanchor qStarts to the minus strand again.

BlatIO provides the following attribute-column mapping:

+----------------+-------------------------+-----------------------------------+
| Object         | Attribute               | Column Name, Value                |
+================+=========================+===================================+
| QueryResult    | id                      | Q name, query sequence ID         |
|                +-------------------------+-----------------------------------+
|                | seq_len                 | Q size, query sequence full       |
|                |                         | length                            |
+----------------+-------------------------+-----------------------------------+
| Hit            | id                      | T name, hit sequence ID           |
|                +-------------------------+-----------------------------------+
|                | seq_len                 | T size, hit sequence full length  |
+----------------+-------------------------+-----------------------------------+
| HSP            | hit_end                 | T end, end coordinate of the last |
|                |                         | hit fragment                      |
|                +-------------------------+-----------------------------------+
|                | hit_gap_num             | T gap bases, number of bases      |
|                |                         | inserted in hit                   |
|                +-------------------------+-----------------------------------+
|                | hit_gapopen_num         | T gap count, number of hit gap    |
|                |                         | inserts                           |
|                +-------------------------+-----------------------------------+
|                | hit_span_all            | blockSizes, sizes of each         |
|                |                         | fragment                          |
|                +-------------------------+-----------------------------------+
|                | hit_start               | T start, start coordinate of the  |
|                |                         | first hit fragment                |
|                +-------------------------+-----------------------------------+
|                | hit_start_all           | tStarts, start coordinate of each |
|                |                         | hit fragment                      |
|                +-------------------------+-----------------------------------+
|                | match_num               | match, number of non-repeat       |
|                |                         | matches                           |
|                +-------------------------+-----------------------------------+
|                | mismatch_num            | mismatch, number of mismatches    |
|                +-------------------------+-----------------------------------+
|                | match_rep_num           | rep. match, number of matches     |
|                |                         | that are part of repeats          |
|                +-------------------------+-----------------------------------+
|                | n_num                   | N's, number of N bases            |
|                +-------------------------+-----------------------------------+
|                | query_end               | Q end, end coordinate of the last |
|                +-------------------------+-----------------------------------+
|                |                         | query fragment                    |
|                | query_gap_num           | Q gap bases, number of bases      |
|                |                         | inserted in query                 |
|                +-------------------------+-----------------------------------+
|                | query_gapopen_num       | Q gap count, number of query gap  |
|                |                         | inserts                           |
|                +-------------------------+-----------------------------------+
|                | query_span_all          | blockSizes, sizes of each         |
|                |                         | fragment                          |
|                +-------------------------+-----------------------------------+
|                | query_start             | Q start, start coordinate of the  |
|                |                         | first query block                 |
|                +-------------------------+-----------------------------------+
|                | query_start_all         | qStarts, start coordinate of each |
|                |                         | query fragment                    |
|                +-------------------------+-----------------------------------+
|                | len [#]_                | block count, the number of blocks |
|                |                         | in the alignment                  |
+----------------+-------------------------+-----------------------------------+
| HSPFragment    | hit                     | hit sequence, if present          |
|                +-------------------------+-----------------------------------+
|                | hit_strand              | strand, hit sequence strand       |
|                +-------------------------+-----------------------------------+
|                | query                   | query sequence, if present        |
|                +-------------------------+-----------------------------------+
|                | query_strand            | strand, query sequence strand     |
+----------------+-------------------------+-----------------------------------+

In addition to the column mappings above, BlatIO also provides the following
object attributes:

+----------------+-------------------------+-----------------------------------+
| Object         | Attribute               | Value                             |
+================+=========================+===================================+
| HSP            | gapopen_num             | Q gap count + T gap count, total  |
|                |                         |  number of gap openings           |
|                +-------------------------+-----------------------------------+
|                | ident_num               | matches + repmatches, total       |
|                |                         | number of identical residues      |
|                +-------------------------+-----------------------------------+
|                | ident_pct               | percent identity, calculated      |
|                |                         | using UCSC's formula              |
|                +-------------------------+-----------------------------------+
|                | query_is_protein        | boolean, whether the query        |
|                |                         | sequence is a protein             |
|                +-------------------------+-----------------------------------+
|                | score                   | HSP score, calculated using       |
|                |                         | UCSC's formula                    |
+----------------+-------------------------+-----------------------------------+

Finally, the default HSP and HSPFragment properties are also provided. See the
HSP and HSPFragment documentation for more details on these properties.


.. [#] You can obtain the number of blocks / fragments in the HSP by invoking
   ``len`` on the HSP

    N)log)SearchIndexer)Hit)HSP)HSPFragment)QueryResult)BlatPslParserBlatPslIndexerBlatPslWriterz^\d+\s+\d+\s+\d+\s+\d+c                    s4    du rdd |  dD S  fdd|  dD S )aC  Transform the given comma-separated string into a list (PRIVATE).

    :param csv_string: comma-separated input string
    :type csv_string: string
    :param caster: function used to cast each item in the input string
                   to its intended type
    :type caster: callable, accepts string, returns object

    Nc                 S      g | ]}|r|qS  r   .0xr   r   G/var/www/html/myenv/lib/python3.10/site-packages/Bio/SearchIO/BlatIO.py
<listcomp>       z"_list_from_csv.<locals>.<listcomp>,c                    s   g | ]}|r |qS r   r   r   casterr   r   r          )split)
csv_stringr   r   r   r   _list_from_csv   s   
r   c                    sL   t | t |krtdt | t |f |dkr| S  fddt| |D S )aZ  Reorients block starts into the opposite strand's coordinates (PRIVATE).

    :param starts: start coordinates
    :type starts: list [int]
    :param blksizes: block sizes
    :type blksizes: list [int]
    :param seqlen: sequence length
    :type seqlen: int
    :param strand: sequence strand
    :type strand: int, choice of -1, 0, or 1

    z9Unequal start coordinates and block sizes list (%r vs %r)r   c                    s   g | ]
\}} | | qS r   r   )r   startblksizeseqlenr   r   r          z$_reorient_starts.<locals>.<listcomp>)lenRuntimeErrorzip)startsblksizesr   strandr   r   r   _reorient_starts   s   r%   c                 C   s   t | d dkr@| d d dkr"| d | d d d| d	 d   kS | d d d
kr@| d | d | d d d| d	 d    kS dS )z%Validate if psl is protein (PRIVATE).r$         +tendtstarts   
blocksizes-tstarttsizeFr   )pslr   r   r   _is_protein   s   $r3   c           	      C   s   |rdnd}d}|| d | d   }| d | d  }t ||}|dkr%dS || }|dk r/dn|}|| d | d	  | d
   }|dkr[d| d
 | | d  tdtd|    | }|S )zCalculate millibad (PRIVATE).r,   r'   r   qendqstartr)   r/   matches
repmatches
mismatchesi  
qnuminsert)minroundr   )	r2   
is_proteinsize_mulmillibad	qali_size	tali_sizeali_sizesize_diftotalr   r   r   _calc_millibad  s*   

	rD   c                 C   s@   |rdnd}|| d | d d?   || d   | d  | d  S )zCalculate score (PRIVATE).r,   r'   r6   r7   r8   r9   
tnuminsertr   )r2   r<   r=   r   r   r   _calc_score#  s   
rF   c                    s  t |}|r	d}n|d d dkrdnd}z|d d dkr dnd}W n ty.   d}Y nw |r3dnd t|d |d |d	 |}t|d d
kr_t|d  fdd|d D |d |}n|d }t|t|  krvt|d ksyJ  J tt|dd t||d D }tt| fddt||d D }	d|v rd|v rt|d t|d   krt|  krt|	ksJ  J n
t|t|	ksJ g }
t|D ]L\}}|d}|sdn|| }|d}|sdn|| }t| |||d}d|_	|d |_
|d |_|	| d |_|	| d |_||_||_|
| qt|
}|j
|d ks0J |j|d ks:J |j|d ksDJ |j|d ksNJ  fdd|jD }||j  krh|d kskJ  J |d |_|d |_|d |_|d |_|d |_|d  |_|d! |_|d" |_|d |d  |_|d |d!  |_|d  |d"  |_||_d#t ||d$  |_!t"|||_#t|d d
k|_$|S )%z*Create high scoring pair object (PRIVATE).r   r$   r(   r'   r+   r,   qstartsr-   qsizer&   r*   c                       g | ]} | qS r   r   )r   iblocksize_multiplierr   r   r   F  r   z_create_hsp.<locals>.<listcomp>r0   c                 S   s   g | ]\}}|| qS r   r   r   r   yr   r   r   r   P  r   c                    s   g | ]
\}}||   qS r   r   rM   rK   r   r   r   U  r   tseqsqseqs )hitqueryDNAr5   r4   r/   r)   c                    s   g | ]}|  qS r   r   )r   spanrK   r   r   r     r   r6   r8   r7   ncountr9   qbaseinsertrE   tbaseinsertg      Y@g?)%r3   
IndexErrorr%   r   listr!   	enumerategetr   molecule_typequery_start	query_end	hit_starthit_endquery_strand
hit_strandappendr   hit_span_allquery_span_all	match_nummismatch_nummatch_rep_numn_numquery_gapopen_numquery_gap_numhit_gapopen_numhit_gap_num	ident_numgapopen_numgap_numquery_is_proteinrD   	ident_pctrF   score_has_hit_strand)hidqidr2   r<   qstrandhstrandrG   hstartsquery_range_allhit_range_allfragsidxqcoordshseqlisthseqqseqlistqseqfraghsp	hit_spansr   rK   r   _create_hsp/  s   ,





&







r   c                   @   s:   e Zd ZdZdddZdd Zdd Zd	d
 Zdd ZdS )r   zParser for the BLAT PSL format.Fc                 C   s   || _ | j  | _|| _dS Initialize the class.N)handlereadlinelinepslx)selfr   r   r   r   r   __init__  s   
zBlatPslParser.__init__c                 c   sd    | j sdS tt| j  s#| j | _ | j sdS tt| j  r|  D ]}d|_|V  q'dS )z1Iterate over BlatPslParser, yields query results.Nblat)	r   research_RE_ROW_CHECKstripr   r   _parse_qresultprogram)r   qresultr   r   r   __iter__  s   zBlatPslParser.__iter__c                 C   s  | j sJ dd | j  dD }| | i }|d |d< t|d |d< |d |d	< t|d
 |d< t|d |d< t|d |d< t|d |d< t|d |d< t|d |d< t|d |d< t|d |d< t|d |d< |d |d< t|d |d< t|d  |d!< t|d" |d#< t|d$ |d%< t|d& |d'< t|d( t|d)< t|d* t|d+< t|d, t|d-< | jrt|d. |d/< t|d0 |d1< |S )2z6Return a dictionary of parsed column values (PRIVATE).c                 S   r   r   r   r   r   r   r   r     r   z,BlatPslParser._parse_row.<locals>.<listcomp>		   qname
   rH      tname   r0   r   r6   r'   r8   r&   r7   r,   rV      r9      rW      rE      rX      r$      r5      r4      r/      r)      
blockcount   r-      rG      r*      rP      rO   )r   r   r   _validate_colsintr   r   )r   colsr2   r   r   r   
_parse_row  s:   

zBlatPslParser._parse_rowc                 C   sR   | j st|dkrtd| jt|f dS t|dkr'td| jt|f dS )z2Validate column's length of PSL or PSLX (PRIVATE).r   zAInvalid PSL line: %r. Expected 21 tab-separated columns, found %i   zBInvalid PSLX line: %r. Expected 23 tab-separated columns, found %iN)r   r   
ValueErrorr   )r   r   r   r   r   r     s   zBlatPslParser._validate_colsc                 c   sP   d}d}d}d}d}d}d}d\}}	d\}
}d\}}g g }}	 |dur+|}|}
|	}| j r;|  }|d	 }|d
 }	n|}d\}}	|
|krH|}n|}||	ksR||krU|}n|}|durt||
|}|| ||krzt|}|d |_|| g }||ks||krt|
d}|D ]}|| q|d |_|V  ||krdS g }| j	 | _ q!)z$Yield QueryResult objects (PRIVATE).r   r'   r,   r&   r   N)NNTr   r   r0   )idrH   )
r   r   r   rd   r   seq_lenr   absorbr   r   )r   	state_EOFstate_QRES_NEWstate_QRES_SAMEstate_HIT_NEWstate_HIT_SAME
qres_state
file_statecur_qidcur_hidprev_qidprev_hidcurprevhit_listhsp_list	hit_stater   rR   r   r   r   r   r     s`   






zBlatPslParser._parse_qresultNF)	__name__
__module____qualname____doc__r   r   r   r   r   r   r   r   r   r     s    
"r   c                   @   s.   e Zd ZdZeZd
ddZdd Zdd Zd	S )r	   z"Indexer class for BLAT PSL output.Fc                 C   s   t j| ||d dS )r   )r   N)r   r   )r   filenamer   r   r   r   r   .  s   zBlatPslIndexer.__init__c           
      c   s    | j }|d d}d}d}| }| }tt| s3| }| }|s+dS tt| r	 | }dd | |D }|du rM|| }n|| }	|	|krg|	 ||| fV  |	}|t
| }| }|sy|	 ||| fV  dS q4)zCIterate over the file handle; yields key, start offset, and length.r   r   N   	Tc                 S   r   r   r   r   r   r   r   r   I  r   z+BlatPslIndexer.__iter__.<locals>.<listcomp>)_handleseektellr   r   r   _RE_ROW_CHECK_IDXr   r   decoder   )
r   r   query_id_idxqresult_keytab_charstart_offsetr   
end_offsetr   curr_keyr   r   r   r   2  s:   

zBlatPslIndexer.__iter__c           
      C   s~   | j }|| d}d}d}d}	 | }|s	 |S dd | |D }|du r/|| }n|| }	|	|kr:	 |S ||7 }q)zFReturn raw bytes string of a QueryResult object from the given offset.r   N    r   Tc                 S   r   r   r   r   r   r   r   r   f  r   z*BlatPslIndexer.get_raw.<locals>.<listcomp>)r   r   r   r   r   )
r   offsetr   r   r   qresult_rawr   r   r   r   r   r   r   get_rawY  s(   


zBlatPslIndexer.get_rawNr   )	r   r   r   r   r   _parserr   r   r   r   r   r   r   r	   )  s    
'r	   c                   @   s2   e Zd ZdZdddZdd Zdd Zd	d
 ZdS )r
   zWriter for the blat-psl format.Fc                 C   s   || _ || _|| _dS r   )r   headerr   )r   r   r   r   r   r   r   r   u  s   
zBlatPslWriter.__init__c                 C   s   | j }d\}}}}| jr||   |D ],}|rA|| | |d7 }|t|7 }|tdd |D 7 }|tdd |D 7 }q||||fS )zWrite query results to file.)r   r   r   r   r'   c                 s       | ]}t |V  qd S Nr1   r   rR   r   r   r   	<genexpr>      z+BlatPslWriter.write_file.<locals>.<genexpr>c                 s       | ]}t |jV  qd S r   )r   	fragmentsr   r   r   r   r         )r   r   write_build_header
_build_rowr   sum)r   qresultsr   qresult_counterhit_counterhsp_counterfrag_counterr   r   r   r   
write_file|  s   zBlatPslWriter.write_filec                 C   s   d}|dd 7 }|S )z-Build header, tab-separated string (PRIVATE).zpsLayout version 3
z
match	mis- 	rep. 	N's	Q gap	Q gap	T gap	T gap	strand	Q        	Q   	Q    	Q  	T        	T   	T    	T  	block	blockSizes 	qStarts	 tStarts
     	match	match	   	count	bases	count	bases	      	name     	size	start	end	name     	size	start	end	count
%s
z---------------------------------------------------------------------------------------------------------------------------------------------------------------r   )r   r   r   r   r   r     s   	zBlatPslWriter._build_headerc                    s  g }|D ];}|j D ]4}t|dd}|rdnd g }||j ||j ||j ||j ||j ||j ||j	 ||j
  fdd|jD }|j|kr^td|j}|d jdkrkd	}	nd
}	tdd |jD |j|j|d j}
|d jdkrd}|jr|	d	7 }	nd}|	d
7 }	tdd |jD |j|j|}||	 ||j ||j ||j ||j ||j ||j ||j ||j |t| |ddd |D d  |ddd |
D d  |ddd |D d  | jr2|ddd |jD d  |ddd |jD d  |ddd |D  q
qd|d S )zGReturn a string or one row or more of the QueryResult object (PRIVATE).rr   Fr,   r'   c                    rI   r   r   )r   srK   r   r   r     r   z,BlatPslWriter._build_row.<locals>.<listcomp>z0HSP hit span and query span values do not match.r   r(   r.   c                 S      g | ]}|d  qS r   r   r   r   r   r   r     r   r+   c                 S   r   r   r   r   r   r   r   r     r   r   c                 s   r   r   strr   r   r   r   r     r   z+BlatPslWriter._build_row.<locals>.<genexpr>c                 s   r   r   r   r   r   r   r   r     r   c                 s   r   r   r   r   r   r   r   r     r   c                 s   r   r   r   seqr   r   r   r   r     r   c                 s   r   r   r   r   r   r   r   r     r   r   c                 s   r   r   r   r   r   r   r   r     r   
)hspsgetattrrd   rg   rh   ri   rj   rk   rl   rm   rn   rf   re   r   rb   r%   r{   r   rc   ru   r|   r   r^   r_   r`   ra   r   joinr   	query_allhit_all)r   r   qresult_linesrR   r   rr   r   eff_query_spansblock_sizesr$   rG   ry   rz   r   rK   r   r     sv   


  CzBlatPslWriter._build_rowN)FF)r   r   r   r   r   r   r   r   r   r   r   r   r
   r  s    
r
   __main__)run_doctestr   )r   r   mathr   Bio.SearchIO._indexr   Bio.SearchIO._modelr   r   r   r   __all___PTR_ROW_CHECKcompiler   encoder   r   r%   r3   rD   rF   r   r   r	   r
   r   
Bio._utilsr  r   r   r   r   <module>   s6    4

h I~
