Extracts data from websites. Use the attribute ATTR to determine which part is to be extracted. Normally, this part is generated by the Extraction Wizard. The EXTRACT command searches the HTML source code of the website for the nth occurence of ATTR and extracts everything between the open (<>) and close (</>) tag that is last in Anchor. Anchor must always end with a wildcard *.
If in one macro several EXTRACT commands appear, the results are separated by the string [EXTRACT]. This tag is automatically translated into a line break when using the SAVEAS TYPE=EXTRACT command.
If complete tables where extracted, adjacent table elements are separated by the string #NEXT# and ends of table rows are delimited by the string #NEWLINE#. These tags are automatically translated into comas and newlines when you use the SAVEAS TYPE=EXTRACT command.
Extract Hidden Input Fields
To do this, record an EXTRACT command for a visible field (e.g. the name input field) and you get
EXTRACT POS=1 TYPE=TXT ATTR=<INPUT*
we add the field name to the extraction anchor:
EXTRACT POS=1 TYPE=TXT ATTR=<INPUT*abc*
Syntax
EXTRACT POS=[R]n TYPE=(TXT|HREF|TITLE|ALT) ATTR=Anchor*
Parameters
POS
The number of the occurence of the extraction anchor on the website. If this attribute is of the form Rn, the nth occurence after a previously selected website element is extracted (Relative Extraction).
TYPE
Type of extraction.
TXT
Plain text extraction, all HTML tags are taken out.
HREF
The URL of the page element the extraction anchor points to.
TITLE
The title of the page element the extraction anchor points to.
ALT
The alternative text of an image the extraction anchor points to.
ATTR
The extraction anchor. This attribute decides which part of the website is extracted. The wildcard * can be used. The extraction anchor always ends with an *.
Examples
Suppose the following HTML code is given and you would like to extract the text of the link in the second row, second column: bar
<table>
<tr>
<td><a href="1.html">1.</a></td>
<td><a href="foo.html">foo</a></td>
</tr>
<tr>
<td><a href="1.html">2.</a></td>
<td><a href="bar.html">bar</a></td>
</tr>
</table>
This can be done by extrating the text between the fourth TD tag:
EXTRACT POS=4 TYPE=TXT ATTR=<TD>*
Or by extracting the text of the link tag:
EXTRACT POS=1 TYPE=TXT ATTR=<A<SP>HREF="bar.html">*
Or by extracting the text of the next TD tag following the TD tag with the text 2. (relative extraction):
TAG POS=1 TYPE=TD ATTR=TXT:2.
EXTRACT POS=R1 TYPE=TXT ATTR=<TD>*
Or by extracting the fourth link using the wildcard *:
EXTRACT POS=4 TYPE=TXT ATTR=<A<SP>HREF=*>*
Tipps and tricks can also be found here.
See Also
TAG, SAVEAS
Page URL http://www.iopus.com/imacros/help/cmd_extract.htm